Skyline Regular unable to process SWATH data set

support
Skyline Regular unable to process SWATH data set mmidha  2017-09-21 12:58
 
Hi Brendan,

I have a SWATH data set comprising of 45 Wiff files (15 conditions X 3 replicates each). I am trying to import many swath files simultaneously to speed up the analysis. At Data analysis step, importing results window got stuck at 40% forever, which I later cancel all imports. I tried the same data set by importing several files simultaneously, though it completes the analysis but takes days for me to fetch the report.

During any time in analysis, Skyline did not use more than 10% CPU. It seems Skyline was unable to use maximum potential of the system. Here are the details of the system on which I am running Skyline regular.

Intel Xenon CPU E5-2683 v4 @2.10Ghz
Logical Processores: 64
Memory: 256GB
DISK 0(C):4TB
DISK 1(D):10TB

My question is can I use Skyline regular for analyzing this size of SWATH data set? If yes, how can I speed up the analysis (it would be great if skyline can utilize the full potential of the system)? If not, then what is the alternative (Skyline runner)?

Let me know if you need any other details.

cheers

Mukul
 
 
Nick Shulman responded:  2017-09-21 16:59
Can you send us your files? We will try to figure out what is taking so long.
You can upload those files here:
https://skyline.ms/files.url

or, if you have any other way of sharing large files such as Google Drive or DropBox or your own server, that works too.
 
Brendan MacLean responded:  2017-09-21 17:13
I will say, however, that you really want to be using the Skyline command-line interface (SkylineRunner or SkylineCmd) on that type of computer, with the --import-process-count to take full advantage of the processing power you have there.

I have only used a NUMA server with 48 logical processors, but I can perform pretty massive processing with it using 12 parallel processes, which produce a 6-8x speed-up from single threaded mode. In the UI, with multiple threads I can only get about 3x speedup. Something about overloading the garbage collector, I think. I am able to get even a little more performance out of the multithreaded case on a NUMA server, if I switch to server mode garbage collection, but still not as good as running multi-process.

We started a project to make multi-process import work in the UI, but never quite made it to passing the chromatogram extraction progress from the running processes. It is much easier to pass just the percentage completed, as we do for command-line.

You should review the two new large-scale DIA tutorials and start with the scripts on those pages.

https://skyline.ms/webinar14.url
https://skyline.ms/webinar15.url

Maybe someday we will either get it working in the Skyline UI or create another UI for running this multi-process import with log capture.

Sorry we haven't made it easier yet, but when you get this set up correctly, it should really scream on that system. If you need help, maybe we can set up a WebEx, after you have begun working with the command-line interface.

--Brendan
 
Brendan MacLean responded:  2017-09-21 17:47
You may also want to stick to "Several" in the UI. In the a recent Skyline-daily I have capped the number of parallel files we will use in the UI at 8, but in the released Skyline 3.7:

Many = 1/2 * logical processors
Several = 1/4 * logical processors

Which means you have only 32 or 16 parallel files to choose from, and it is unlikely you will actually benefit from that much parallelism, and it can definitely make things worse. Maybe I should even cap "Several" at 4 so that you always have a lower choice.

What kind of disk drive are you working on? And what kind of files are you importing?

I just created the following tip with information about how performance for parallel file import scales on both i7 and PowerEdge NUMA systems:

https://skyline.ms/wiki/home/software/Skyline/page.view?name=perf_scaling

Hope these resources are helpful for you.
 
mmidha responded:  2017-09-22 14:16
Thank you Brendan for your help.

True, analyzing DIA data set through multi-process import in the UI is not at all working in my case and further it also discourage me to use it for really large data set(something around 360 wiff files). Sure I will try skyline command line, may reach to you again for your help related to it very soon.

Regarding the kind of hard drive- they are standard disk drives and there model no. is HGST HUH721010ALE600. I do not have any much information about the disk drives. what kind of specification you are looking for?

I am importing 45 SWATH (100vW) wiff files generated on TripleTOF 6600 and on average each Wiff file is around 3GB to 4GB.

Another thing while using UI, RAM shortest was never a problem, means with 256 GB in total and at any time of processing more than 70% RAM availability makes it clear that there is no problem related to automatic memory management.

Cheers

Mukul
 
Brendan MacLean responded:  2017-09-22 14:29
One last question, since you mention WIFF files: are you importing them using the "TOF" mass analyzer setting or "Centroided"?

While I have had good luck processing SCIEX 6600 data in Centroided mode, I have never seen it benefit 5600 data, and SCIEX does its centroiding on the fly, which is something like 10x slower than just importing in profile mode with TOF.

If you have 5600 data, just stick to TOF.
If you have 6600 data and you want to use Centroided with it, then use ProteoWizard to convert it to mzML before you process it.

It is important that you convert to mzML and not mz5, if you have a spinning hard drive and you plan on using parallel import. Parallel import actually makes things worse for mz5 on a spinning drive. While mzML does quite nicely.

If you stick with TOF as your mass analyzer, then you may do slightly worse in detections and quantitative precision on 6600 data (not a lot), but profile-mode WIFF import works well with parallel file import.

Never try to import WIFF files directly in Centroided mode.
 
mmidha responded:  2017-09-22 14:55
OK Brendan. I have been importing wiff files using TOF as product mass analyzer with 30,000 resolving power setting.
 
Brendan MacLean responded:  2017-09-22 15:00
Should be fine, then. Out of curiosity how many transitions does your Skyline document contain? (both target and decoy)
 
mmidha responded:  2017-09-22 15:28
There are around 3 million (50% targets and 50% decoys (shuffle sequences)) transitions.
 
Brendan MacLean responded:  2017-09-23 15:49
I don't know that you are going to make it through 360 WIFF files with 3 million transitions. That is over 1 billion chromatograms. You wouldn't see any memory issues importing on the command-line until you get to the end and try to join everything together and build the mProphet model. I would expect you to run into problems around 50-100 runs with that many transitions. I have successfully processed 20 runs with 6.5 million transitions and about 60 runs with 1.5 million transitions on a system with 196 GB of RAM.

You may need to process this in batches, export reports and then run statistics in R on the reports.