Issue 904: Store more spectrum and instrument information in .blib files when possible

issues
Status:open
Assigned To:Matt Chambers
Type:Defect
Area:Skyline
Priority:2
Milestone:23.1
Opened:2022-08-24 13:00 by Brendan MacLean
Changed:2022-09-08 11:37 by Matt Chambers
Resolved:
Resolution:
Closed:
2022-08-24 13:00 Brendan MacLean
Title»Store more spectrum and instrument information in .blib files when possible
Assigned To»Matt Chambers
Notify»Kaipo Tamura (test);Brian Pratt;Rita Chupalov;Nick Shulman
Type»Defect
Area»Skyline
Priority»2
Milestone»23.1
Now that we have a property sheet associated with the spectrum graph pane that we use to show spectral library information, it seems likely that people would want to see even more information about the spectra and the instrument they came from when that is possible for us to store and show. Rita will be adding similar information to the property sheet we will be showing in the Full-Scan graph.

To do this, we will need to begin storing these values in the .blib files. Here are a few ideas:
- mass analyzer (TOF, LIT, Orbitrap)
- CE
- isolation range
- MS level (could be 2 or 3)
- Scan Info

Potentially the values shown in SeeMS that we could just calculate now:
- Data Points
- Base Peak Intensity (would this be easier to read in scientific notation?)
- Total Ion Current (would this be easier to read in scientific notation?)

File properties:
- Instrument type
- Instrument serial number
- Acquisition time

2022-08-26 20:41 Matt Chambers
Any preference on adding these properties as columns or as a new key-value table? The latter would allow adding new properties without causing schema compatibility issues (old Skyline versions could user newer blibs and just ignore any new properties).

The Bruker TDF and TSF formats have a pretty smart implementation of the key-value approach. They have property groups which are rarely varying (or run-wide) properties, and an individual property table for properties that can vary from frame to frame. The frame-level properties can override the group-level properties and, with a simple SQL query or view, both sets can be enumerated together as if they are all frame-level properties. This would avoid thousands of spectra rows having redundant values for mass analyzer, ion source, activation type, ms level, scan range, etc.

It's tempting to put the few file level properties in SpectrumSourceFiles but I'm not 100% sure that a blib input file (e.g. mzIdentML) can't have spectra from multiple runs. I guess we could consider it a bug if it we had that case and weren't splitting each run into separate rows in SpectrumSourceFiles.

2022-09-06 12:02 Brian Pratt
I like where Matt's thinking leads on this. We currently treat the .blib format more like a flat file than a database. It's true that there are performance implications for fancier use of queries, but as we cache to our own binary format after an initial read these are less troublesome than they might be.

2022-09-08 08:48 Matt Chambers
I didn't consider the SLC file. Does that mean any new property Skyline reads from blib has to be added to SLC as well?

2022-09-08 08:58 Brendan MacLean
Not at all. The SLC (Skyline Library Cache) for .blib files only contains essentially the index for the library. After that Skyline relies on the SQLite file itself to look up anything else it needs. It was just too slow doing an initial SELECT * to get the full list and IDs. For other types of libraries, especially the text versions like SPTXT and MSP everything including the spectra end up in a binary SLC file. I guess it is worth noting that the SLC files, which sit silently beside their source libraries and can be deleted, are not all a consistent format, but are dependent on the file type they are caching.

2022-09-08 11:15 Nick Shulman
I don't like the idea of a "Key & Value" table. The problem with tables like that is that different properties have different storage types. With the "exp.ObjectProperty" table in LabKey, the table has columns "floatValue", "dateTimeValue" and "stringValue", and in any particular table row, only one of those columns has a non-null value.
In theory, with SQLite, we would not need to do that, since SQLite columns can hold any data type. I am not sure whether we should actually take advantage of this feature.

I think we should just add columns to the RefSpectra table for all the values we want to store. We should only add columns for things that we have data for, and readers of the database should have to look at the database schema in order to see which values are available.

I might be wrong. It might be that we should come up with logical groups of properties (probably groups of values which are likely to be shared by many rows in the RefSpectra table, such as Isolation Target, Isolation Window Lower, Isolation Window Upper) and there would be a foreign keys in the RefSpectra table which point to values in other tables.

2022-09-08 11:37 Matt Chambers
The polymorphic column values seem to work well for Bruker's format. They have a crap-ton of properties: way more than we would have. This is just a few of the non-null ones for a single frame:
Frame    Property    PermanentName    Value    Type
2    387    Collision_Bias_Act    -26.091    real
2    269    Collision_GasFlushSwitch_Set    0    integer
2    95    Collision_GasFlushTime_Set    60.0    real
2    391    Collision_GasSupply_Act    55.02    real
2    389    Collision_In_Act    -69.863    real
2    390    Collision_Out_Act    -25.444    real
2    305    Collision_QuenchCycle_Set    63    integer
2    388    Collision_RF_Act    683.594    real
2    267    Configuration_SyncFeedbackActive    1    integer
2    292    Digitizer_AcquisitionTime_Set    75.0    real
2    417    Digitizer_ActualSpectraRate    13.458225667528    real
2    306    Digitizer_AppendScans    0    integer
2    416    Digitizer_CurrentTemp    49.0    real
2    415    Digitizer_Cycle_Summation    774    integer
2    307    Digitizer_DiscardScans    0    integer
2    308    Digitizer_ExtractTriggerTime    96.0    real
2    300    Energy_Ramping_Collision_Energy_Active    0    integer
2    392    FocusPreCollisionCell_Lens1_Act    -31.502    real
2    393    FocusPreCollisionCell_Lens2_Act    50.66    real
2    394    FocusPreCollisionCell_Lens3_Act    -29.292    real
2    379    FocusPreQuadrupole_Lens1_Extraction_Act    -36.492    real
2    378    FocusPreQuadrupole_Lens1_Storage_Act    -36.44    real
2    279    FocusPreQuadrupole_Lens1_Voltages_Cmd    Blob    blob
...snip

I don't like the idea of having a variable number of columns depending on the input data. That doesn't make sense when you could have different values in different input files: so you'd still have to check which values are null in every row. That isn't the case in the key-values approach where only non-null values could be selected. With the PropertyGroup optimization, clients can query in either a performance-oriented way (read and cache the property group values in memory and only read the non-group values for every spectrum row) or a convenience-oriented way (read all values for every spectrum row ignoring which ones are group values).

In the Bruker schema, the PropertyGroup column was added to the main Frames table instead of in a separate linker table. That's probably the right way to go unless we have very compelling reason to not add ANY columns to the RefSpectra table.

There's also the matter of how this would be represented in the BiblioSpec memory model. If we go with property groups in the sqlite schema to reduce table size, it probably makes sense to read those into property groups in the BiblioSpec RefSpectrum class.