Thermo Raw files import speed

support
Thermo Raw files import speed evgeny onishchenko  2018-08-22 02:29
 

Dear Skyline Team,

We are testing our new cluster to make it perform optimally for various tasks including Skyline (Skyline Daly). To this end we have compared import time of the same raw
DIA Thermo files stored either on high-speed network or on the cluster’s local SSD and, second, did the same test with a single desktop (raw files on local SSD vs high-speed network, less cores and less RAM than on the cluster).
We observed that the import time was basically independent in the processing power and the amount of available RAM, but was dramatically dependent on where
the files are accessed from. In both tests local SSD performed ~10 x faster than the high-speed network (~ 3 min vs 30 min).
My question is what would be the best way to increase Skyline performance with cluster and take advantage of its high processing power?
Would it be possible to run Skyline in a way that it relies on RAM when processing the raw files?

Thanks so much in advance for your help and advise!
Best,
Evgeny

 
 
Brian Pratt responded:  2018-08-22 10:28
Hi Evgeny,

You've identified the bottleneck with reading any file - it's faster on a local drive than over a network, and its faster on a local SSD than on a local spinning disk HDD.

Add to that basic issue the fact that we don't really have control over how the file is accessed by the Thermo reader, so we can't do anything to optimize reads like try to grab bigger file chunks at a time etc.

You can, of course, grab the biggest file chunk of all, by moving the files to local storage before feeding them into your pipeline (or, rather, making that the first step in your pipeline). That's your best move, I think.

I hope this helps!

Brian Pratt
 
Brendan MacLean responded:  2018-08-22 20:14
Hi Evgeny,
Agreed. When processing on a cluster with any software I would not advise doing any heavy processing across a network. Instead, I would always copy the files to the local drive of the computer doing the processing and then do the processing there.

Skyline is actually relatively cluster-friendly for large-scale cluster processing (say importing 100 files), because you can import each file on a separate machine and then join the imported files in one final process. You do this with the command-line argument --import-no-join. This will do the import up to the point of producing a single Skyline .skyd file for the imported raw data file, with the same root-name as raw data file in the same folder as the .sky file. You can then copy that file back onto the network drive as the output of the processing on that node.

Finally, you copy all the .skyd files produced onto a single node in your cluster and repeat the import. Now I am remembering that this final step has an issue in that the .skyd file names contain a hash of the path from which they came, and you would want to avoid needing to copy all the original data files local again for this final step. I feel like we solved this for a cluster pipeline in the MacCoss lab, but maybe not. It may be that our own pipeline is still jumping through hoops to handle this last step. I will take another look and work out a solution to this final join step, if need be. I will report back to this thread.

Thanks for bringing up the topic of importing raw data into a Skyline document on a HPC cluster. We are very interested in this case, and we are working on adapting the Skyline command-line interface to run on LInux.

--Brendan
 
evgeny onishchenko responded:  2018-08-23 06:15
Dear Skyline team,

Thanks a lot for the info and suggestions, this is very helpful!
We’ll try parallelise the import for many files, and get back to you in case we stuck somewhere.

Evgeny
 
Brendan MacLean responded:  2018-08-26 14:30
Hi Evgeny,
Checking with the people in our lab implementing the cluster-enabled pipeline, it seems that our current solution for the join step is just to pass in the same paths that were originally supplied to Skyline on the separate nodes where the raw data files were copied locally (i.e. their local paths on those computers). This will allow Skyline to generate file names for the partial .skyd files copied locally to the same directory as the .sky file, and when it finds them, it skips looking for the original raw data files.

Probably still would be better to allow just pointing at the .skyd files as they contain their original paths and this would make the join step a bit simpler, but re-importing the original paths without the --import-no-join argument is what is implemented and working now.

Let us know how it goes. Thanks for your interest in using Skyline command-line on a cluster.

--Brendan