Table of Contents |
guest 2024-10-14 |
BiblioSpec is a suite of software tools for creating and searching MS/MS peptide spectrum libraries.
BiblioSpec 2.0 stores spectrum libraries as sqlite3 files. Sqlite3 is a light-weight, open-source database format which can be read and manipulated with any sqlite3 tools in addition to BiblioSpec. For more information about the library format, see the file formats page. The new format is a departure from version 1.0 which uses a unique binary format. This means that tools and libraries from the two versions are not compatible. There is, however, a conversion tool for turning a version 1.0 library into a sqlite3 library.
The BiblioSpec package contains the following programs:
BiblioSpec is freely available under the BSD license. Click here to go to the Download and build page.
Several reference libraries will be available soon for download.
An overview of all file formats including a list of all the database search files that can be used to build libraries.
Creates a library of spectra with known peptide and/or small molecule identifications. Typically, these identifications are done with a database search such as SEQUEST or Mascot, sometimes followed by an evaluation step such as percolator or Peptide Prophet. BlibBuild accepts files from a variety of database search programs, as well as some other spectral library formats. File formats are identified by file extension, which are given in the table below. In many cases, the peptide identification (peptide sequence, charge state and optional score) are in a separate file from the spectrum information. Unless noted, it is assumed that both files will be in the same directory.
Database search | Peptide ID file extension | Spectrum file extension *RAW includes vendor formats like RAW, WIFF, .D, etc. |
Score Used | Notes |
Generic SSL | .ssl | score column | A generic format for encoding spectrum library entries. | |
ByOnic | .mzid | .MGF, .mzXML, .mzML | AbsLogProb | |
Comet/SEQUEST/Percolator | .perc.xml, .sqt | .cms2, .ms2, .mzXML | q-value | Percolator v1.17 does not include sequence modification information therefore the .sqt file from the SEQUEST search must be present in the same directory, the directory containing the cms2/ms2 spectrum files, or the current working directory. |
DIA-NN | .speclib | none | No separate spectrum file. In the current implementation, no score is imported from the library, so all spectra are imported. | |
IDPicker | .idpXML | .mzXML, .mzML | FDR | The name(s) of the spectrum file(s) are given in the .idpXML file. |
MS Amanda | .pep.xml, .pepXML | .mzML, .mzXML, .MGF, RAW* | q-value | |
MSFragger | .pep.xml, .pepXML | .mzML, .mzXML, .MGF, RAW* | q-value | |
MSGF+ | .mzid, .pepXML | .mzML, .mzXML, .MGF, RAW* | expectation value | |
Mascot | .dat | expectation value | No separate spectrum file. | |
MaxQuant Andromeda | msms.txt + evidence.txt + mqpar.xml + modifications.xml | .mzML, .mzXML, .MGF, RAW* | PEP | It is possible to use peaks embedded in the msms.txt, but external spectra files are preferred because the embedded peaks are charge deconvoluted. mqpar.xml must be located in the grandparent, parent, or same directory. A custom modifications.xml , modifications.local.xml , or modification.xml can be placed in the same directory as the search results (or specified using the -x option). |
Morpheus | .pep.xml, .pepXML | .mzXML, .mzML | q-value | The names of the .mzXML files are given in the .pep.xml file and may be in the parent or grandparent directory. Spectra are looked up by index, which is calculated using (scan number - 1). |
OMSSA | .pep.xml, .pepXML | .mzXML, .mzML | expectation value | The names of the .mzXML files are given in the .pep.xml file and may be in the parent or grandparent directory. |
OpenSWATH | .tsv | m_score column | No separate spectrum file. | |
PEAKS DB | .pep.xml, .pepXML | .mzXML, .mzML | confidence score | The names of the .mzXML files are given in the .pep.xml file and may be in the parent or grandparent directory. |
PLGS MSe | final_fragment.csv | score column | There need not be a . before 'final_fragment'.. | |
PRIDE | .pride.xml | various | No separate spectrum file. | |
PeptideProphet/iProphet | .pep.xml, .pepXML | .mzML, .mzXML, .MGF, RAW* | probability score | The names of the .mzXML files are given in the .pep.xml file and may be in the parent or grandparent directory. |
PeptideShaker | .mzid | .MGF | confidence score | |
Protein Pilot | .group.xml | confidence score | No separate spectrum file. | |
Protein Prospector | .pep.xml, .pepXML | .mzML, .mzXML, .MGF, RAW* | expectation value | |
Proteome Discoverer | .msf, .pdResult | q-value | No separate spectrum file. Libraries cannot be built from databases that do not contain q-values, unless a cutoff score of 0 is explicitly specified. | |
Proxl XML | .proxl.xml | .mzML, .mzXML, .MGF, RAW* | q-value | |
Scaffold | .mzid | .MGF, .mzXML, .mzML | peptide probability | |
Spectronaut | .csv | none | Spectronaut Assay Library export. No separate spectrum file. | |
Spectrum Mill | .pep.xml, .pepXML | .mzXML, .mzML | expectation value | The names of the .mzXML files are given in the .pep.xml file and may be in the parent or grandparent directory. |
X! Tandem | .xtan.xml | expectation value | No separate spectrum file. |
BlibBuild [options] <peptide id file>[+] <library name>
<peptide id file>
– A file containing peptide spectrum matches to be included in the library. The associated spectrum files should be in the same directory as the peptide id file but should not be given on the command line. See the above table for recognized formats. Multiple files may be listed together.<library name>
– The name of the library being created. An existing library may be overwriten or added to.A spectrum library in in sqlite3 format.
Create a library from an existing one such that the new library has only one spectrum for each peptide ion. The representative spectrum is chosen by taking the dot product of all pairs of spectra for a peptide and selecting the one with the highest average score.
BlibFilter [options] <redundant-library>
<filtered-library>
<initial library>
– A library file with multiple spectra for all or some peptide ions.<output library>
– The name to be given to the resulting library.A library of spectra for the same peptides as the initial library, but with only one spectrum per peptide ion.
-m [ --memory-cache ] <size>
– SQLite memory cache size in Megs. Default 250M.-n [ --min-peaks ] <num>
– Only include spectra with at least this many peaks. Default 20.-s [ --min-score ] <score>
– Best spectrum must have at least this average score to be included. Default 0.-p [ --parameter-file ] <file>
– File containing search parameters. Command line values override file values.-v [ --verbosity ] <level>
– Control the level of output to stderr. (silent, error, status, warn, debug, detail, all) Default status.-h [ --help ]
– Print help message.Search a spectrum library for matches to query spectra.
BlibSearch [options] <spectrum filename> <library filename>[+]
<spectrum filename>
– A file containing spectra to search. File formats accepted are .ms2, .cms2, .mzXML, .mzML, .MGF, and .wiff (Windows only).<library name>
– The library to be searched for matches to the query. Libraries may be filtered (the output of BlibFilter) or redundant (the output of BilbBuild). More than one library can be listed on the command line.Results are printed to a report file (tab-delimited text). The file may be named with the --report-file
option or by default it is named after the spectrum file with the extension replaced with .report. A seprate report file is written for any decoy spectra searched. An optional sqlite .psm file may also be produced.
-c [ --clear-precursor ] <true|false>
– Remove the peaks in a X m/z window around the precursor from the query and library spectrum. Default true.--topPeaksForSearch <num>
– Use this many of the highest intensity peaks. Default 100.-w [ --mz-window ] <size>
– Compare query to library spectra with precursor m/z +/- size. Default 3.-L [ --low-charge <charge>
– ] Search only spectra with charge no less than this. Default 1.-H [ --high-charge ] <charge>
– Search only spectra with charge no higher than this. Default 5.-m [ --report-matches ] <num>
– Return this number of the best matches for each query. Use -1 to report all. Default 5.--psm-result-file <name>
– Return results in a .psm file of the given name. Default no .psm file.-R [ --report-file ] <name>
– Return results in report file of the given nam. Default is .report.--preserve-order
– Search spectra in the order they appear in the file. Default to search as sorted by precursor m/z.-p [ --parameter-file ] <name>
– File containing search parameters. Command line values override file values.-v [ --verbosity ] <level>
– Control the level of output to stderr. (silent, error, status, warn, debug, detail, all) Default status.-h [ --help ]
– Print help message.Write an MS2 file that contains all spectra in a library.
BlibToMS2 [options] <library>
<library>
– a spectrum library file, filtered or redundant.The spectra are printed to a file named <library>.ms2 in the MS2 format. The scan number is replaced with the library ID number. Two 'D' lines contain the peptide sequence with and without modifications.
-f [ --file-name ] <ms2 file>
– Use this name for the output MS2 file rather than the default name, <library>.ms2.-m [ --mz-precision ] <num>
– Write the peak m/z values with this many digits of precision. Default 2.-i [ --intenisty-precision ] <num>
– Write the peak intensity values with this many digits of precision. Default 1.-p [ --parameter-file ] <file>
– Specify parameters in a separate file. Command line vales override the file.-v [ --verbose ] <silent|error|status|warn
>
– Set the verbosity level of the output to stderr. The default level is status.-h [ --help ]
– Print the help message.Converts a BiblioSpec 1.0 library to a 2.0 library in sqlite3 format.
LibToSqlite3 <old version lib> <new lib name>
<old version lib>
– A BiblioSpec 1.0 library file.<new lib name>
– The name to be given to the converted library.A spectrum library in in sqlite3 format.
Database search | Peptide ID file extension | Spectrum file extension *RAW includes vendor formats like RAW, WIFF, .D, etc. |
Score Used | Notes |
Generic SSL | .ssl | score column | A generic format for encoding spectrum library entries. | |
ByOnic | .mzid | .MGF, .mzXML, .mzML | AbsLogProb | |
Comet/SEQUEST/Percolator | .perc.xml, .sqt | .cms2, .ms2, .mzXML | q-value | Percolator v1.17 does not include sequence modification information therefore the .sqt file from the SEQUEST search must be present in the same directory, the directory containing the cms2/ms2 spectrum files, or the current working directory. |
DIA-NN | .speclib | none | No separate spectrum file. In the current implementation, no score is imported from the library, so all spectra are imported. | |
IDPicker | .idpXML | .mzXML, .mzML | FDR | The name(s) of the spectrum file(s) are given in the .idpXML file. |
MS Amanda | .pep.xml, .pepXML | .mzML, .mzXML, .MGF, RAW* | q-value | |
MSFragger | .pep.xml, .pepXML | .mzML, .mzXML, .MGF, RAW* | q-value | |
MSGF+ | .mzid, .pepXML | .mzML, .mzXML, .MGF, RAW* | expectation value | |
Mascot | .dat | expectation value | No separate spectrum file. | |
MaxQuant Andromeda | msms.txt + evidence.txt + mqpar.xml + modifications.xml | .mzML, .mzXML, .MGF, RAW* | PEP | It is possible to use peaks embedded in the msms.txt, but external spectra files are preferred because the embedded peaks are charge deconvoluted. mqpar.xml must be located in the grandparent, parent, or same directory. A custom modifications.xml , modifications.local.xml , or modification.xml can be placed in the same directory as the search results (or specified using the -x option). |
Morpheus | .pep.xml, .pepXML | .mzXML, .mzML | q-value | The names of the .mzXML files are given in the .pep.xml file and may be in the parent or grandparent directory. Spectra are looked up by index, which is calculated using (scan number - 1). |
OMSSA | .pep.xml, .pepXML | .mzXML, .mzML | expectation value | The names of the .mzXML files are given in the .pep.xml file and may be in the parent or grandparent directory. |
OpenSWATH | .tsv | m_score column | No separate spectrum file. | |
PEAKS DB | .pep.xml, .pepXML | .mzXML, .mzML | confidence score | The names of the .mzXML files are given in the .pep.xml file and may be in the parent or grandparent directory. |
PLGS MSe | final_fragment.csv | score column | There need not be a . before 'final_fragment'.. | |
PRIDE | .pride.xml | various | No separate spectrum file. | |
PeptideProphet/iProphet | .pep.xml, .pepXML | .mzML, .mzXML, .MGF, RAW* | probability score | The names of the .mzXML files are given in the .pep.xml file and may be in the parent or grandparent directory. |
PeptideShaker | .mzid | .MGF | confidence score | |
Protein Pilot | .group.xml | confidence score | No separate spectrum file. | |
Protein Prospector | .pep.xml, .pepXML | .mzML, .mzXML, .MGF, RAW* | expectation value | |
Proteome Discoverer | .msf, .pdResult | q-value | No separate spectrum file. Libraries cannot be built from databases that do not contain q-values, unless a cutoff score of 0 is explicitly specified. | |
Proxl XML | .proxl.xml | .mzML, .mzXML, .MGF, RAW* | q-value | |
Scaffold | .mzid | .MGF, .mzXML, .mzML | peptide probability | |
Spectronaut | .csv | none | Spectronaut Assay Library export. No separate spectrum file. | |
Spectrum Mill | .pep.xml, .pepXML | .mzXML, .mzML | expectation value | The names of the .mzXML files are given in the .pep.xml file and may be in the parent or grandparent directory. |
X! Tandem | .xtan.xml | expectation value | No separate spectrum file. |
BiblioSpec makes use of several file formats for input and output. Below are descriptions of these along with links to additional information.
In most cases libraries are built from database search result files. Supported formats are listed on the BlibBuild page.
For peptide or small molecule identifications that do not come from one of the supported database searches, BiblioSpec supports a generic tab-delimited text file format refered to as ssl (spectrum sequence list). Here is a small example file. An ssl file must end with the '.ssl' extension and have a header line with the following column names in it (the score-type, score, and retention-time columns are optional):
file scan charge sequence score-type score retention-time start-time end-time
additional columns for small molecule use may be included (the sequence column should be omitted for small molecule libraries - here is a small example file):
adduct precursorMZ moleculename inchikey otherkeys
additional columns for ion mobility information may be included - css units are square Angstroms, ion mobility units can be 'ms' (drift time msec), 'V' (FAIMS voltage), '1/K0' (inverse K0), or 'none':
ion-mobility ion-mobility-units ccs
Each of the following lines contains information for one spectrum. The first column contains a full or relative path to a file containing the spectrum (e.g. vendor formats like .raw, .wiff, etc. or .ms2, .mzML, .mzXML, .mgf).
In an .ms2 file there are four types of lines. Lines beginning with 'H' are header lines and contain information about how the data was collected as well as comments. They appear at the beginning of the file. Lines beginning with 'S' are followed by the scan number and the precursor m/z. Lines beginning with 'Z' give the charge state followed by the mass of the ion at that charge state. Lines beginning with 'D' contain information relevant to the preceeding charge state. BlibToMs2's output will include D-lines with the sequence and modified sequence. The file is arranged with these S, Z and D lines for one spectrum followed by a peak list: a pair of values giving each peaks m/z and intensity. Here is an example file.
The second column has an id for that spectrum, typically a scan number or index number. The third column is the charge state of the spectrum. The fourth column contains the peptide sequence, with the addition of any modifications given as a mass shift (the difference between the modified and unmodified residue) following the modified residues. For example,
TASEFDC[+57.0]SAIO[+16.0]AQDK
Peptides with n-terminal modifications should have these mass shift follow the first residue.
The score-type column can be any of the following:
UNKNOWN |
PERCOLATOR QVALUE |
PEPTIDE PROPHET SOMETHING |
SPECTRUM MILL |
IDPICKER FDR |
MASCOT IONS SCORE |
TANDEM EXPECTATION VALUE |
PROTEIN PILOT CONFIDENCE |
SCAFFOLD SOMETHING |
WATERS MSE PEPTIDE SCORE |
OMSSA EXPECTATION SCORE |
PROTEIN PROSPECTOR EXPECTATION SCORE |
SEQUEST XCORR |
MAXQUANT SCORE |
and the score column is a floating point value representing the spectrum's score of that type. The retention time column can be used to specify retention times in minutes; otherwise the values from the spectrum file will be used. Scores fall into three categories: probability that identification is correct, probability that identification is incorrect, or not a probability score. This information can be found in the ScoreTypes table.
Library files
BiblioSpec library files are in the sqlite3 format, usually with a ".blib" filename extension. Each library is a small database that you can search and manipulate with standard SQL commands using, for example, the sqlite3 command line tools or SQLite Expert Personal.
Details on the BiblioSpec SQLite schema can be found here.
file scan charge sequence demo.ms2 8 3 VGAGAPVYLAAVLEYLAAEVLELAGNAAR demo.ms2 1806 2 LAESITIEQGK demo.ms2 2572 2 ELAEDGC[+57.0]SGVEVR demo.ms2 3088 2 TTAGAVEATSEITEGK demo.ms2 3266 2 DC[+57.0]EEVGADSNEGGEEEGEEC[+57.0] demo.ms2 9734 3 IWELEFPEEAADFQQQPVNAQ[-17.0]PQN demo.ms2 20919 3 VHINIVVIGHVDSGK ../elsewhere/spec.mzXML 00497 2 LKEPAQNTADNAK ../elsewhere/spec.mzXML 00680 2 ALEGPGPGEDAAHSENNPPR ../elsewhere/spec.mzXML 00965 2 FFSHEAEQK ../elsewhere/spec.mzXML 01114 2 C[+57.0]GPSQPLK ../elsewhere/spec.mzXML 01382 2 AVHVQVTDAEAGK
H CreationDate Mon Apr 12 15:12:14 2010 H Extractor BlibToMs2 H Library /home/me/research/search/demo.blib S 1 1 636.34 Z 2 1253.36 D seq FKNGFQTGSASK D modified seq FKNGFQTGSASK 187.40 12.5 193.10 19.5 194.30 13.7 198.30 29.8 199.10 12.2 208.30 23.1 208.90 11.4 210.30 11.8 213.00 3.3 214.50 4.3 216.10 32.8 219.10 11.2 221.00 14.3 222.10 64.0 225.10 16.6 226.00 31.6 228.30 7.2 229.10 8.5 230.50 58.2 231.20 236.1 232.20 75.8 233.60 2.4 234.20 51.4 235.10 5.6 236.30 30.2 239.70 14.4 241.30 34.8 242.30 14.2 244.30 9.0 S 2 2 745.3 Z 2 1471.7 D seq NFLETVELQVGLK D modified seq NFLETVELQVGLK 1224.60 7.9 1228.70 468.9 1230.40 658.5 1231.50 144.2 1240.00 11.7 1242.70 45.9 1243.80 16.8 1253.80 17.2 1255.00 7.9 1255.80 14.4 1259.70 15.5 1273.10 5.9 1275.90 10.5 1277.10 7.8 1283.30 4.7 1296.50 19.2 1299.50 13.0 1307.40 6.1 1308.40 21.3 1313.00 1.7 1313.80 5.5 1315.40 3.6 1316.80 22.3 1323.90 1.5 1325.50 40.5 1326.30 75.9 S 3 3 732.1 Z 2 1444.7 D seq NEVSAMPTLLLFK D modified seq NEVSAMPTLLLFK 209.00 62.5 210.30 12.8 216.00 87.0 220.10 58.0 224.90 4.9 226.10 418.2 227.00 68.3 227.90 46.7 229.20 13.3 231.10 12.7 238.10 209.1 239.20 15.0 244.10 953.8 245.20 90.0 245.90 20.4 252.30 8.8 255.30 38.8 260.20 9.4 262.10 35.0 270.00 10.9 275.80 21.8 277.40 6.3 279.10 12.7 280.20 49.8
Note that these are tab separated fields, and the otherkeys field itself is tab separated.
file scan charge adduct inchikey chemicalformula moleculename otherkeys dexcaf_051017.mzML 01369 -1 [M-H] ZXPLRDFHBYIQOX-BTBVOZEKSA-N C24H44O21N0 Glc04Reduced dexcaf_051017.mzML 01639 -1 [M-H] NBVGBCYERZIRIP-JAMOUWTMSA-N C30H54O26N0 Glc05Reduced dexcaf_051017.mzML 01855 -1 [M-H] PNHJKLJIDNHXFR-ZGJYWSOBSA-N C36H64O31N0 Glc06Reduced dexcaf_051017.mzML 02029 -1 [M-H] NVKJDLBVRSXYRE-BMFDHOHESA-N C42H74O36N0 Glc07Reduced dexcaf_051017.mzML 02179 -1 [M-H] YMRGEPQWJZHXFF-MGQBKJSVSA-N C48H84O41N0 Glc08Reduced dexcaf_051017.mzML 01079 -1 [M-H] RYYVLZVUVIJVGH-UHFFFAOYSA-N C8H10N4O2 Caffeine "InChI:1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3 HMDB:01847 CAS:58-08-2 SMILES:Cn1cnc2n(C)c(=O)n(C)c(=O)c12"
The BiblioSpec source code is available as part of the ProteoWizard project on GitHub:
A Mascot Parser license should also be requested if you intend to parse Mascot .dat files; once you receive the download instructions:
C:\Program Files (x86)\Matrix Science\Mascot Parser
(32-bit) or C:\Program Files\Matrix Science\Mascot Parser
(64-bit)/usr/local/msparser/gnu
You will need Visual Studio 2017 to build.
trunk/pwiz
directory from the ProteoWizard git repository.%HOMEDIR%\Documents\pwiz
)-j<number of threads to build with> --hash optimization=space address-model=<32|64> pwiz_tools/BiblioSpec > path-to-build-log-file
. For example, a 64-bit Windows build using four threads may be done with the command "quickbuild.bat -j4 --hash optimization=space address-model=64 pwiz_tools\BiblioSpec > build.log
". Note that "optimization=space
" may be omitted for a debug build.The resulting build will be located in the build-<os>-<architecture>/pwiz_tools/BiblioSpec
directory.
64 bit Windows binaries (BlibBuild.exe, BlibFilter.exe and BlibToMs2.exe) are available from the ProteoWizard automated build and test system. (If you encounter a "Log in to TeamCity" page, just click the "log in as guest" link to proceed.) Note that this download does not provide a proper installer: it's intended for updating existing installations only. It likely will not function if unzipped to a bare directory.