Skyline-daily and Skyline 20.1 crashed in loading pepxml from MSFragger fcyu  2020-01-30
 

In loading pepXML or ineract.pep.xml from MSFragger/FragPipe using the "Import DDA Peptide Search", skyline crashed with the following error. Looks like it can find the mzML file which is in the same folder, but crashed somehow. You may find the mzML, pepXML, interact.pep.xml, and fasta file in dev2.zip from file sharing. Could you please help to take a look?

---------------------------
Skyline-daily
---------------------------
ERROR: No spectra were found for the new library.

Command-line: C:\Users\yufe\AppData\Local\Apps\2.0\KG99EWD8.1KH\L9HYCN5B.KLC\skyl..tion_e4141a2a22107248_0014.0000_e9276d4fd631b09e\BlibBuild -s -A -H -o -c 0.95 -i test -S "C:\Users\yufe\AppData\Local\Temp\tmp6570.tmp" "E:\dev\msfragger\dev2\test.redundant.blib"
Working directory: E:\dev\msfragger\dev2
---------------------------
OK More Info
---------------------------
System.IO.IOException: ERROR: No spectra were found for the new library.

Command-line: C:\Users\yufe\AppData\Local\Apps\2.0\KG99EWD8.1KH\L9HYCN5B.KLC\skyl..tion_e4141a2a22107248_0014.0000_e9276d4fd631b09e\BlibBuild -s -A -H -o -c 0.95 -i test -S "C:\Users\yufe\AppData\Local\Temp\tmp6570.tmp" "E:\dev\msfragger\dev2\test.redundant.blib"
Working directory: E:\dev\msfragger\dev2
   at pwiz.Common.SystemUtil.ProcessRunner.Run(ProcessStartInfo psi, String stdin, IProgressMonitor progress, IProgressStatus& status, TextWriter writer) in C:\proj\pwiz_x64\pwiz_tools\Shared\Common\SystemUtil\ProcessRunner.cs:line 62
   at pwiz.BiblioSpec.BlibBuild.BuildLibrary(LibraryBuildAction libraryBuildAction, IProgressMonitor progressMonitor, IProgressStatus& status, String& commandArgs, String& messageLog, String[]& ambiguous) in C:\proj\pwiz_x64\pwiz_tools\Shared\BiblioSpec\BlibBuild.cs:line 201
   at pwiz.Skyline.Model.Lib.BiblioSpecLiteBuilder.BuildLibrary(IProgressMonitor progress) in C:\proj\pwiz_x64\pwiz_tools\Skyline\Model\Lib\BiblioSpecLiteBuilder.cs:line 160
---------------------------

Thanks,

Fengchao

 
 
Brendan MacLean responded:  2020-01-30

While this is a failure, it is not a "crash". You are given a useful error message. "No spectra were found for the new library."

If you feel this was in error, please supply the input file "test" after checking that you feel confident it contains peptide spectrum matches that meat the 0.95 cutoff.

To me, the input file path "test" looks highly suspicious.

Thanks for supplying the full text of the error message.

--Brendan

 
fcyu responded:  2020-01-30

Thanks for your prompt reply. I am almost sure that the data I used here had many high confident PSMs. You may find my test.sky from test.sky.zip in the "file sharning". This time, I set the cutoff to 0 but there are still errors:

---------------------------
Skyline-daily
---------------------------
ERROR: No spectra were found for the new library.

Command-line: C:\Users\yufe\AppData\Local\Apps\2.0\KG99EWD8.1KH\L9HYCN5B.KLC\skyl..tion_e4141a2a22107248_0014.0000_e9276d4fd631b09e\BlibBuild -s -A -H -o -c 0 -i test -S "C:\Users\yufe\AppData\Local\Temp\tmpF15E.tmp" "E:\dev\msfragger\dev2\test.redundant.blib"
Working directory: E:\dev\msfragger\dev2
---------------------------
OK More Info
---------------------------
System.IO.IOException: ERROR: No spectra were found for the new library.

Command-line: C:\Users\yufe\AppData\Local\Apps\2.0\KG99EWD8.1KH\L9HYCN5B.KLC\skyl..tion_e4141a2a22107248_0014.0000_e9276d4fd631b09e\BlibBuild -s -A -H -o -c 0 -i test -S "C:\Users\yufe\AppData\Local\Temp\tmpF15E.tmp" "E:\dev\msfragger\dev2\test.redundant.blib"
Working directory: E:\dev\msfragger\dev2
   at pwiz.Common.SystemUtil.ProcessRunner.Run(ProcessStartInfo psi, String stdin, IProgressMonitor progress, IProgressStatus& status, TextWriter writer) in C:\proj\pwiz_x64\pwiz_tools\Shared\Common\SystemUtil\ProcessRunner.cs:line 62
   at pwiz.BiblioSpec.BlibBuild.BuildLibrary(LibraryBuildAction libraryBuildAction, IProgressMonitor progressMonitor, IProgressStatus& status, String& commandArgs, String& messageLog, String[]& ambiguous) in C:\proj\pwiz_x64\pwiz_tools\Shared\BiblioSpec\BlibBuild.cs:line 201
   at pwiz.Skyline.Model.Lib.BiblioSpecLiteBuilder.BuildLibrary(IProgressMonitor progress) in C:\proj\pwiz_x64\pwiz_tools\Skyline\Model\Lib\BiblioSpecLiteBuilder.cs:line 160
---------------------------

Thanks,

Fengchao

 
Brendan MacLean responded:  2020-01-30

My mistake. The "-i test" in the command line is indeed referring to the basename of your "test.sky" file and is the library ID, not the input file.

What we will really need is your MSFragger output pepXML and mz[X]ML to look at why BlibBuild.exe is not finding any peptide spectrum matches in your files with a 0.95 cutoff.

Thanks. I will also try to get the BlibBuild developer, Matt Chambers to take over this discussion.

--Brendan

 
Nick Shulman responded:  2020-01-30
Fengchao,

I see that this used to work in Skyline 19.1.
It looks like we might have changed the way that we interpret MSFragger pep xml files. We used to look up spectra by number, but now we seem to be trying to look them up by name, and we are not finding any spectra with the names that we are looking for.
I do not know how this code is supposed to work but I will ask around and see if I can figure out what is going on.
-- Nick
 
nesvi responded:  2020-02-03
Nick, Brendan,

Thanks for looking into this. I can confirm the error. Note that with timsTOF data (where Skyline goes into MGF files), it is fine. The problem, as you said, seem to be with not finding any spectra in mzML files using the information in the pep.xml file.

Thanks again,
Alexey Nesvizhskii
 
matt.chambers42 responded:  2020-02-03
Alexey, could MSFragger add an attribute to spectrum_query to store the nativeID when it is known? I know the pepXML schema does not have it, , but looking up spectra in mzML from pepXML attributes without the nativeID is ambiguous: is it the index? Is it the scan number? If the latter, what about formats that don't have unique one-dimensional scan numbers like WIFF and Waters RAW? MSFragger seems to use scan number, so I'd be interested to see what it uses for mzMLs from WIFF and Waters RAW. I can probably "fix" this for scan-number-only formats, but I can't see how it'd work with those multi-dimensional formats.

Also you might consider having MSFragger change the basename when it exports the _calibrated or _uncalibrated MGF files. I've added a workaround in BiblioSpec in the meantime, but the basename should really point at the immediate, available, source file used for generating IDs.

Thanks,
-Matt
 
matt.chambers42 responded:  2020-02-04
Another thing I found. In the MGF MSFragger writes scan numbers without padding, like:
20180819_TIMS2_12-2_AnBr_SA_200ng_HeLa_50cm_120min_100ms_11CT_1_A1_01_2767.165.165.1

But in the pepXML these scan numbers have padding:
20180819_TIMS2_12-2_AnBr_SA_200ng_HeLa_50cm_120min_100ms_11CT_1_A1_01_2767.00165.00165.1

I know the padding is a historical TPP artifact, but it's especially silly in this case when there are 6 digit scan numbers like:
20180819_TIMS2_12-2_AnBr_SA_200ng_HeLa_50cm_120min_100ms_11CT_1_A1_01_2767.146636.146636.1

It's important in the MGF case that the spectrum attribute match the MGF title string, otherwise we won't find the spectrum.
 
matt.chambers42 responded:  2020-02-04
Oh nevermind I think it's Peptide Prophet adding that padding. What a pain...
 
fcyu responded:  2020-02-05

Hi Matt,

Thanks for your suggestion. We made two changes accordingly:

  1. We added a "native_id" attribute to the "spectrum_query" tag. For the pepXML that is not from mzML, mzXML, or raw, we used the scan name (<run name>.<scan number>.<scan number>.<charge>) as the native ID.
    Here is an example:
    <spectrum_query start_scan="1873" uncalibrated_precursor_neutral_mass="1186.6248" assumed_charge="3" native_id="controllerType=0 controllerNumber=1 scan=1873" spectrum="b1906_293T_proteinID_01A_QE3_122212.1873.1873.3" end_scan="1873" index="1" precursor_neutral_mass="1186.6260" retention_time_sec="1588.305">
    <search_result>
    <search_hit peptide="SPSAVAMQAGPR" massdiff="0.0497" calc_neutral_pep_mass="1186.5763" peptide_next_aa="A" num_missed_cleavages="0" num_tol_term="2" num_tot_proteins="1" tot_num_ions="44" hit_rank="1" num_matched_ions="6" protein="sp|Q92794|KAT6A_HUMAN Histone acetyltransferase KAT6A OS=Homo sapiens OX=9606 GN=KAT6A PE=1 SV=2" peptide_prev_aa="R" is_rejected="0">
    <modification_info>
    <mod_aminoacid_mass mass="147.0354" position="7"/>
    </modification_info>
    <search_score name="hyperscore" value="14.496"/>
    <search_score name="nextscore" value="13.130"/>
    <search_score name="expect" value="2.715e+00"/>
    </search_hit>
    </search_result>
    </spectrum_query>

  2. We put the original extension rather than "mzBIN_calibrated" to the "raw_data_type" and "raw_data" attributes in the "msms_run_summary".
    Here is an example:
    <msms_run_summary base_name="b1906_293T_proteinID_01A_QE3_122212" raw_data_type="mzML" comment="This pepXML was from calibrated spectra." raw_data="mzML">

To the suggestion of changing <run name>_calibrated.mgf, we prefer to keep it to preventing overwriting user's original mgf file. You are more than welcome to give a better solution.

I uploaded a zip file with name dev4.zip containing one set of files (pepXML, interact.pep.xml, and mzML) from Thermo instrument and another set of files (pepXML, interact.pep.xml, and _calibrated.mgf) from timsTOF for you to test.

Thanks,

Fengchao

 
fcyu responded:  2020-02-07

Hi Matt,

I have one follow up question: given a raw format file, what does the native ID look like? We took the native ID string, which I guess was generated by msconvert, in mzML, but there is no such string in raw file. Could you please provide some suggestion?

Thanks,

Fengchao

 
matt.chambers42 responded:  2020-02-14

Sorry this fell off the radar. For your follow-up question: if by "raw format file" you mean a Thermo RAW file, then: the nativeID formats are basically chosen by the PSI mzML working group based on how the spectra are addressed in the vendor's software and API. Thermo's API has separate controllers and each may have its own set of scan numbers, so scan number by itself isn't always unique.

Great news about the native_id attribute. I'll add support for it in BiblioSpec soon. For sources where you don't have a true native id, use either 0-based "index=xxx" (e.g. for MGF) or 1-based "scan=xxx" (e.g. for mzXML; you can't assume an mzXML is from Thermo and use Thermo nativeIDs). These formats have their own nativeID formats in the controlled vocabulary.

Regarding the mzML->MSFragger->interact.pep.xml pipeline, the fix (not using new native_id attribute) will be in the next Skyline-daily or you can get the bleeding edge from BiblioSpec.zip or SkylineTester.zip: https://teamcity.labkey.org/viewLog.html?buildId=lastSuccessful&buildTypeId=bt209&tab=artifacts

 
fcyu responded:  2020-02-14

Thanks for your reply. I tested the bleeding edge BiblioSpec.zip with my interact.pep.xml files (without native_id) from Thermo and timsTOF. It looks good. I am wondering if the "native_id" is necessary? If not, I prefer not to write it because it takes more time in IO and breaks the schema more.

Thanks,

Fengchao

 
matt.chambers42 responded:  2020-02-14

It's necessary for wiff and Waters mzml input. Definitely leave it in, at least for native id formats that contain multiple terms.

 
fcyu responded:  2020-02-14

Thanks for the suggestion. We will write native_id given mzML or mzXML file.

Thanks,

Fengchao