Large dataset processing, skyd silently deleted

support
Large dataset processing, skyd silently deleted Phil Charles  2019-07-31
 

Hiya,

We are processing a fairly large SWATH-type DIA experiment through skyline. Based on a fractionated spectral library, the .sky file contains 100,197 precursors; 970,600 transitions. We have 207 Lumos raw files to add to the project. I'm aware this is probably somewhat outside usual operating bounds. My ultimate goal is to export a custom report containing the XICs of each transition for further analysis and comparison with other processing workflows.

Because the skyline interface becomes unresponsive when dealing with this number of transitions, after setting up the .sky and importing from the spectral library, I saved and closed the document then used skylinerunner to do the raw file import on a fast desktop with --import-process-count=32, (and making sure to also add --save!). The machine thought about this for quite a while (10 days, although I think it would have been faster if I had used less threads to avoid cache swapping) and finally generated a corresponding .skyd that's just over a terabyte in size. The processing finished with no error messages logged to the console.

I backed up the whole project and then tried to open the .sky file. Skyline considered this request for about 15 minutes, then opened the file but without any results. In the background, I later discovered, it silently deleted the .skyd, which was Not Helpful. Thank goodness for backups!

I have restored from backup and tried again, with the same result. I've also tried setting the .skyd to read only / write access denied, in which case Skyline is unable to assist me by deleting the terabyte of cached chromatograms (representing quite a lot of processing time), but still reports no results attached to the document. When I try to export the report with skylinerunner, there's also no replicate information.

Is there any way to persuade Skyline to use the data in the .skyd, and salvage the progress so far? Also, the silent deletion thing ...isn't ideal.

Many thanks and best wishes,

Phil

 
 
Nick Shulman responded:  2019-07-31
Can you send us the .sky file and the .skyd file? I realize those are going to be huge files, but it's the best way for us to figure out what exactly happened.

You can upload the files here:
https://skyline.ms/files.url

Or, if you have another way of sharing large files such as Google Drive or DropBox, that works too.

It sounds like the replicates in the .sky file ended up not matching what was in the .skyd file. That could have happened if the .sky file never got saved after the replicates were added. If that's the case, then it might be straightforward to recover. What you would do is open up your .sky file in Skyline, and add all of your replicates, making sure that they have exactly the same full path as when your first made Skyline extract chromatograms. Skyline will begin to extract chromatograms. While Skyline is still working you save your document and exit Skyline.
Then, you copy over the .skyd file from your backup, giving it the same base filename as your other .sky file.
Then you start Skyline up. Skyline will notice that there is a .skyd file there that has chromatograms for all of the replicates in the .sky file, and Skyline will know that it's done.
-- Nick
 
Phil Charles responded:  2019-08-01
Hi Nick,

Thank you for the fast response and suggestions.

I copied the .sky file to a new location then opened it in Skyline, confirmed there were no replicates currently associated, and re-saved it into the original location. I then went to File->Import and imported all the replicate files. Interestingly, I did not have to quit - after thinking for a little while Skyline recognised the existing .skyd (which I still have set to read-only, as it is rather time consuming to keep replacing a 1TB file from backup!) and went straight to importing the data from the cache. So far so good. Unfortunately, the import process does not complete, I instead get an 'Index was outside the bounds of the array.' error - see attached 'err.PNG' and 'loadingSkydError.txt'. I also get the same error if I try to import just one replicate rather than all 207.

I tried saving the file anyway (since despite the error message, it does at least list the replicates in the .sky file) and trying with SkylineRunner. Exporting a report with the names and paths of all replicates returned the expected result, but exporting a report which draws on chromatogram data (e.g. transition Raw Times and Intensities) just does not include any chromatographic data columns at all (although there are no error messages in the console, and it reports that it's opening the .skyd cache, I'm unsure if the subsequent progress percentages reported may instead refer to the opening progress for the spectral library). See attached 'exportingTransitionsViaSkylineRunnerError.txt'.

I am happy to share the project files but would prefer to send via a non-public route. Could you email me (philip [dot] charles [at] ndm [dot] ox [dot] ac [dot] uk) so I can reply with a weblink? I will also have to split the 1TB .skyd into many sub-25GB files with 7zip in order to avoid file size upload limits - am I ok to assume you will be ok with reconstituting it at your end?

Cheers,

Phil
 
Nick Shulman responded:  2019-08-09
Phil,

Thanks for sending me your files.
The reason that Skyline was having a problem was that there were more than 2 billion candidate peaks in the .skyd file (these come from one million transitions and 200 replicates).
I sent you a special build of Skyline that is able to handle up to 4 billion peaks. You can use that to go to "Edit > Manage Results" and create smaller Skyline documents that will be able to be opened with regular builds of Skyline.

I am going to try to make it so that future versions of Skyline never try to create a .skyd file with too many peaks in them.
That is, when you are extracting chromatograms Skyline first puts the chromatograms into individual .skyd files containing the chromatograms for each .raw file. Then, at the end of it all Skyline joins all of those individual .skyd files together into one big .skyd. I will make it so that this joining never happens if any of the limits would be exceeded.

Longer term, it would be nice to remove this limitation for .skyd files, but that will probably have to wait until we come up with a new format for these files.
-- Nick
 
Tobi responded:  2019-08-16
Dear Nick,

thank you for the extensive explanation. Im not sure how frequently you give out the special built for 4 billion peaks, but if possible I would be happy to take a look as this is likely beeing needed in one of our on-going projects.

If you have a link or a small installer:

toby37678@freenet.de

Looking forward hearing from you.

Best regards,
tobi
 
Nick Shulman responded:  2019-08-16
Tobi,

Here's the special Skyline installer that can handle more than 2 billion candidate peaks without getting that "Index was outside the bounds of the array" error.
https://skyline.ms/_webdav/home/support/%40files/20190816_SkylineInstallerHigherPeakLimit.msi

It was my intention that this would only be used to do "Manage Results" on a .skyd file that was getting that error in order to shrink it down to a more manageable size.
If you want to routinely have .skyd files that are that big, then we will need to go forward with another plan of ours which is to keep large .skyd files unmerged.
-- Nick