Issue 904: Store more spectrum and instrument information in .blib files when possible: /home/issues

2022-08-24 13:00

Brendan MacLean

Title	»	Store more spectrum and instrument information in .blib files when possible
Assigned To	»	Matt Chambers
Notify	»	Kaipo Tamura (test);Brian Pratt;Rita Chupalov;Nick Shulman
Type	»	Defect
Area	»	Skyline
Priority	»	2
Milestone	»	23.1

Now that we have a property sheet associated with the spectrum graph pane that we use to show spectral library information, it seems likely that people would want to see even more information about the spectra and the instrument they came from when that is possible for us to store and show. Rita will be adding similar information to the property sheet we will be showing in the Full-Scan graph.

To do this, we will need to begin storing these values in the .blib files. Here are a few ideas:
- mass analyzer (TOF, LIT, Orbitrap)
- CE
- isolation range
- MS level (could be 2 or 3)
- Scan Info

Potentially the values shown in SeeMS that we could just calculate now:
- Data Points
- Base Peak Intensity (would this be easier to read in scientific notation?)
- Total Ion Current (would this be easier to read in scientific notation?)

File properties:
- Instrument type
- Instrument serial number
- Acquisition time

2022-08-26 20:41

Matt Chambers

Any preference on adding these properties as columns or as a new key-value table? The latter would allow adding new properties without causing schema compatibility issues (old Skyline versions could user newer blibs and just ignore any new properties).

The Bruker TDF and TSF formats have a pretty smart implementation of the key-value approach. They have property groups which are rarely varying (or run-wide) properties, and an individual property table for properties that can vary from frame to frame. The frame-level properties can override the group-level properties and, with a simple SQL query or view, both sets can be enumerated together as if they are all frame-level properties. This would avoid thousands of spectra rows having redundant values for mass analyzer, ion source, activation type, ms level, scan range, etc.

It's tempting to put the few file level properties in SpectrumSourceFiles but I'm not 100% sure that a blib input file (e.g. mzIdentML) can't have spectra from multiple runs. I guess we could consider it a bug if it we had that case and weren't splitting each run into separate rows in SpectrumSourceFiles.

2022-09-06 12:02

Brian Pratt

I like where Matt's thinking leads on this. We currently treat the .blib format more like a flat file than a database. It's true that there are performance implications for fancier use of queries, but as we cache to our own binary format after an initial read these are less troublesome than they might be.

2022-09-08 08:48

Matt Chambers

I didn't consider the SLC file. Does that mean any new property Skyline reads from blib has to be added to SLC as well?

2022-09-08 08:58

Brendan MacLean

Not at all. The SLC (Skyline Library Cache) for .blib files only contains essentially the index for the library. After that Skyline relies on the SQLite file itself to look up anything else it needs. It was just too slow doing an initial SELECT * to get the full list and IDs. For other types of libraries, especially the text versions like SPTXT and MSP everything including the spectra end up in a binary SLC file. I guess it is worth noting that the SLC files, which sit silently beside their source libraries and can be deleted, are not all a consistent format, but are dependent on the file type they are caching.

2022-09-08 11:15

Nick Shulman

I don't like the idea of a "Key & Value" table. The problem with tables like that is that different properties have different storage types. With the "exp.ObjectProperty" table in LabKey, the table has columns "floatValue", "dateTimeValue" and "stringValue", and in any particular table row, only one of those columns has a non-null value.
In theory, with SQLite, we would not need to do that, since SQLite columns can hold any data type. I am not sure whether we should actually take advantage of this feature.

I think we should just add columns to the RefSpectra table for all the values we want to store. We should only add columns for things that we have data for, and readers of the database should have to look at the database schema in order to see which values are available.

I might be wrong. It might be that we should come up with logical groups of properties (probably groups of values which are likely to be shared by many rows in the RefSpectra table, such as Isolation Target, Isolation Window Lower, Isolation Window Upper) and there would be a foreign keys in the RefSpectra table which point to values in other tables.

2022-09-08 11:37

Matt Chambers

The polymorphic column values seem to work well for Bruker's format. They have a crap-ton of properties: way more than we would have. This is just a few of the non-null ones for a single frame:
Frame Property PermanentName Value Type
2 387 Collision_Bias_Act -26.091 real
2 269 Collision_GasFlushSwitch_Set 0 integer
2 95 Collision_GasFlushTime_Set 60.0 real
2 391 Collision_GasSupply_Act 55.02 real
2 389 Collision_In_Act -69.863 real
2 390 Collision_Out_Act -25.444 real
2 305 Collision_QuenchCycle_Set 63 integer
2 388 Collision_RF_Act 683.594 real
2 267 Configuration_SyncFeedbackActive 1 integer
2 292 Digitizer_AcquisitionTime_Set 75.0 real
2 417 Digitizer_ActualSpectraRate 13.458225667528 real
2 306 Digitizer_AppendScans 0 integer
2 416 Digitizer_CurrentTemp 49.0 real
2 415 Digitizer_Cycle_Summation 774 integer
2 307 Digitizer_DiscardScans 0 integer
2 308 Digitizer_ExtractTriggerTime 96.0 real
2 300 Energy_Ramping_Collision_Energy_Active 0 integer
2 392 FocusPreCollisionCell_Lens1_Act -31.502 real
2 393 FocusPreCollisionCell_Lens2_Act 50.66 real
2 394 FocusPreCollisionCell_Lens3_Act -29.292 real
2 379 FocusPreQuadrupole_Lens1_Extraction_Act -36.492 real
2 378 FocusPreQuadrupole_Lens1_Storage_Act -36.44 real
2 279 FocusPreQuadrupole_Lens1_Voltages_Cmd Blob blob
...snip

I don't like the idea of having a variable number of columns depending on the input data. That doesn't make sense when you could have different values in different input files: so you'd still have to check which values are null in every row. That isn't the case in the key-values approach where only non-null values could be selected. With the PropertyGroup optimization, clients can query in either a performance-oriented way (read and cache the property group values in memory and only read the non-group values for every spectrum row) or a convenience-oriented way (read all values for every spectrum row ignoring which ones are group values).

In the Bruker schema, the PropertyGroup column was added to the main Frames table instead of in a separate linker table. That's probably the right way to go unless we have very compelling reason to not add ANY columns to the RefSpectra table.

There's also the matter of how this would be represented in the BiblioSpec memory model. If we go with property groups in the sqlite schema to reduce table size, it probably makes sense to read those into property groups in the BiblioSpec RefSpectrum class.

MacCoss Lab Software

MacCoss Lab Software

Issue 904: Store more spectrum and instrument information in .blib files when possible