Poor System Utilization by Skyline

Poor System Utilization by Skyline chinmaya k  2019-11-27

Hello Skyline Team,

I am trying to perform DIA analysis with 2 DIA raw files against a quite large spectral library having 581197 precursors corresponding to 477997 peptides using Skyline 19.1 in our Windows workstation (Please find the attachment for details).

However, I find Skyline hardly utilizing CPU and RAM for any small changes with occupying all the available disk space (Currently the disk containing this skyline document has 1.8 TB space).

It would we great if there is an option in the skyline to set the number of processors/cores to be used, I have not seen any.

Thank You,

Brendan MacLean responded:  2019-11-29

Skyline should be making pretty good use of parallelism at this point. We have spent a lot of time on it. But the parallel functions of Skyline are mostly limited multi-file parallelism, which is controlled by the choice you make a file import time of "Files to import simultaneously" in the "Import Results" form, which has defaulted to "One at a time", but which I just changed to default to "Many", the setting I would definitely recommend on a Dell PowerEdge (at least with the current Skyline-daily, but also with just 2 files). Also, --import-threads=# at the command-line should be even better, or --import-process-count=# for Skyline 19.1 (which has better multi-process performance than multi-threaded).

That said, I am not a huge fan of just building bigger and bigger lists of targets. If the library contains 581,197 precursors, then likely you have used the default of 581,197 precursors for decoys. Though, even with the much smaller Pan Human spectral library from ETH, I felt using 1/4 as many decoys was sufficient and had a performance benefit.

Assuming you have doubled your numbers with decoys, then you would have 1,122,934 precursors, which even with just 6 fragment ions each would mean 6.1 million transitions. If you included precursor transitions extracted from MS1 spectra, then you likely doubled the amount of chromatogram extraction that Skyline does.

For just 2 DIA files on the system you have, this should still work out okay, but the current Skyline implementation would limit you to 2 threads for spectrum retrieval (which would be relatively inactive), 2 threads for chromatogram extractions, and 8 threads for peak scoring. So, it is unlikely to make full use of a Dell Xeon PowerEdge with... how many cores? 24, i.e. 48 logical processors after hyperthreading?

For this, you really need to increase the file count to at least 6. Though, I hope you don't want to increase this many targets to large numbers of biologically important smaples. I would definitely recommend starting with some replicate (more than 2, though) refinement around what you can hope to detect reliably and what you can hope to measure with low enough technical variance to hope to see important biological variance.

Anyway, it should be possible to at least import your two files in parallel, and you should get better performance out of Skyline-daily right now than out of Skyline 19.1. We made some multithreading performance improvements in the past month or so to facilitate new diaPASEF support.


P.S. - Yes, probably your exact scenario would benefit from our allowing higher single-file parallelism at the chromatogram extraction level, given the small number of files and the massive number of transitions and the number of cores on your computer, but we have not yet built that flexibility.

chinmaya k responded:  2019-12-09

Hi Brendan,

Thank you for the response.

Finally, I was able to finish the DIA data search for 2 .raw files against the spectral library as mentioned earlier. However, the system took more than 4 days to extract and import both MS1 and MS/MS information and match it with the respective spectral library. As you have mentioned, I have set the import options to several files at a time for maximum system utilization and it did work out by occupying 100% of both 128 GB RAM and >1.5 TB disk space (There were some Proteome Discoverer searches were also performed in between).

The library which I have used consists of 290,000 precursors corresponding to 239,154 peptides, which in turn with an equal number of decoys got increased to 581,197 precursors corresponding to 478,304 respectively as mentioned above. How does the usage of 1/4th of decoys help in calculating FDR ? can you please elaborate?

I had 2,55,61,964 transitions with decoys.

The system has 22 cores with 44 logical processors and yes, the slowness was observed in spectrum retrieval when compared to other steps.

However, a new Skyline-daily gets hang when I tried to reopen transition setting once spectral library match and decoy generation was performed with the same set of spectral library and could not continue thereafter.

Thank you,

Brendan MacLean responded:  2019-12-09

I can't quite read that number "2,55,61,964". The thousands separators seem misplaced. The pan-human assay which I have definitely processed a lot on a similar system without anything like a 4-day wait has 205,233 precursors and 163,053 peptides without decoys. When I was processing it a lot, I found that I got similar results using less than 1:1 targets:decoys. To test this I created a Skyline document with no decoys and used the command-line arguments (--decoys-add=shuffle --decoys-add-count=%DECOYS%) to vary the decoy count across a range of 2x values (or incrementing by log2) and I eventually settled on "40761" or 1/4 the number of peptides.

Skyline uses statistics (Storey-Tibshirani 2003) which do not require the targets:decoys ratio to be 1:1.

I guess there must just be an extra number in your transition total, and you are using close to 2.5 million transitions. That doesn't seem so bad to me, especially on the kind of system you are using, unless for some reason you are ending up doing full-gradient extraction, i.e. extracting all of your chromatograms across the entire gradient.

Thinking about this more, I think this must actually be what is happening to you. Can you please send a screenshot of your Transition Settings - Full-Scan tab and also explain how you are limiting chromatogram extraction to a subset of your gradient for each peptide? I really can't imagine anything like what you are reporting without somehow ending up extracting full-gradient chromatograms, which just isn't going to work for multiple reasons on this kind of analysis.

What you have described sounds only slightly larger than the Pan-Human assay, which I know works in Skyline on the type of system you have. So, once we get your settings worked out, this should be possible.