Support for importing results from the Sage search engine

support
Support for importing results from the Sage search engine mlazear  2023-05-16 09:07
 

Is it possible to add support for importing search results from Sage into Skyline? Sage is an open-source, free, and faster implementation of the fragment indexing approach popularized by MSFragger.

Sage writes a simple TSV based format - a couple notes (happy to provide more info if you decide to add support):

  • filename column may contain ".gz" or ".gzip" at the end - Sage can directly search mzMLs packaged in gzip, as exported by MSConvert, etc. ("b1906_293T_proteinID_01A_QE3_122212.mzML", "b1906_293T_proteinID_01A_QE3_122212.mzML.gz", etc). Sage only supports the mzML format at this time.
  • Search results may be concatenated from multiple files
  • Peptide sequences follow the ProForma notation: N-terminal modifications are specified as "[+42.0]-PEPT[+57.0]IDE", and C-terminal as "PEPTIDE-[+42.0]", etc.
  • Unless supplied with a fasta file already containing decoy entries, Sage will generate decoys by reversing the interior amino acids (SAMPLER becomes SELPMAR)
  • Currently, Sage is outputting unfiltered results, so decoys will be present (label = -1), and uses a built-in linear discriminant analysis PSM rescorer (with spectrum_q, peptide_q, protein_q corresponding to Q-values at each aggregation level and "posterior_error" corresponding to log10(PEP) at the spectrum level)

I have uploaded an example results file to https://skyline.ms/_webdav/home/support/file sharing/%40files/results.sage.tsv

The raw data corresponds to (b1906_293T_proteinID_01A_QE3_122212.mzXML) from the paper An Ultra-tolerant Database Search Identifies more than 100,000 Modified Peptides (PXD001468). I can upload the mzML to S3 if needed.

Please let me know if there is more information, files, etc I can provide!
Best,
Mike

 
 
Brian Pratt responded:  2023-05-17 08:30

Hi Mike,

That looks pretty straightforward. There's be a little work handling the .gz* business, and the scan location column values (not saying this is problematic in this case, but it's often where things get tricky). Otherwise pretty straightforward, I think. It looks like you're the developer, so you could actually take a run at this yourself if you wanted, with our help.

Best,

Brian Pratt

 
mlazear responded:  2023-05-17 09:12

I would be happy to try and contribute, if you could provide some guidance!

 
Brian Pratt responded:  2023-05-17 09:22

Probably the best bang for your buck would be creating a .pepXML output - that would get you into Skyline and a number of other packages as well. And you wouldn't have to spelunk into our codebase. :)

 
Brendan MacLean responded:  2023-05-17 10:01

Another standard format option is mzIndentML, which is arguably the modern standard to replace pepXML. If you have a look at the web page where we have detailed what is currently supported:

https://skyline.ms/wiki/home/software/Skyline/page.view?name=building_spectral_libraries

You can see that ByOnic, MSGF+, PeptideShaker, and Scaffold all support mzIdentML. Though, certainly a lot more support PepXML. Once you have either, we would still need to make a minor modification to our scoring lists to recognize Sage as a valid score provider. And then we could add a line for you to the page linked to above.

Thanks for your interest in integrating with Skyline.

 
mlazear responded:  2023-05-17 10:24

Unfortunately, supporting mzIdentML is not trivial (~900 lines of code). I would strongly prefer to avoid supporting any XML based formats if possible (instead preferring to support modern data storage formats like parquet that can be dropped into the existing scientific software stack [e.g. pandas, polars]), given that we frequently use Sage for searching >100s of files at once and the resulting XML based file formats are space hogs.

It seems like MaxQuant, X!Tandem, Mascot, Proteome Discoverer have their native output formats supported, presumably this is also acceptable for Sage?

 
Brian Pratt responded:  2023-05-17 12:49

Certainly doable, yes.

 
Brendan MacLean responded:  2023-05-17 15:23

Typically, the search results files are not big compared with the data files. I am not a huge fan of the proliferation of TSV formats. Yes, it seems people have an easier time writing them than any form of standardized XML, but it does get a bit tricky for us trying to recognize them all by they columns they contain. MaxQuant is different because it uses a clear filename, msms.txt. The others you list are XML, DAT, and SQLite with the extension pdResult.

 
mlazear responded:  2023-05-18 09:45

Luckily, Sage also uses a clear filename, "results.sage.tsv". I also write a JSON file containing the engine version, search params, input files, and output file locations, e.g.

{
  "version": "0.13.0",
  "database": {
    "enzyme": {
      "missed_cleavages": 1,
      "min_len": 7,
      "max_len": 40,
      "cleave_at": "KR",
      "restrict": "P",
      "c_terminal": true
    },
    "static_mods": {
      "C": 57.0215
    },
    "variable_mods": {
      "M": [
        15.994
      ]
    },
    "fasta": "fasta/human_yeast.fasta"
  },
  "precursor_tol": {
    "da": [
      -2.0,
      2.0
    ]
  },
  "fragment_tol": {
    "ppm": [
      -50.0,
      50.0
    ]
  }
  "deisotope": true,
  "chimera": true,
  "wide_window": true,
  "min_peaks": 15,
  "max_peaks": 250,
  "max_fragment_charge": 2,
  "min_matched_peaks": 2,
  "report_psms": 4,
  "predict_rt": true,
  "mzml_paths": [
    "path/A1.mzML.gz",
   "..."
  ],
  "output_paths": [
    "PRM/results.sage.pin",
    "PRM/results.sage.tsv",
    "PRM/results.json"
  ]
}
 
mlazear responded:  2023-05-19 13:48

What's the best way to get started on this?

 
Brian Pratt responded:  2023-05-22 08:44

I still think your best plan would be mzIdentML, but if we're not going that route then you'll want to get the Skyline and BiblioSpec code from GitHub and look at how we handle the other formats you mentioned.

That's a fairly steep hill to climb, though. We can probably help you with this after ASMS (or if it's me personally, not until the end of June due to travel plans).