how to improve mprohpet model for difference between targets and decoys?

how to improve mprohpet model for difference between targets and decoys? Matthias  2019-07-03

Dear Skyline Team,

I was analzing a DIA data set following the very nice and super helpful Webinar 14 (DIA large scale).

However when i came to the point to train the mprophet model I observed that in my data the difference between targets and decoys is not as nice as in the video and in the DIA Tutorial which was handed out during a course last year in Seattle.

Please see the attached pdf with some of the blots.

I was wondering how one could improve the target decoy discrimination?
Would this be achievable by e.g. demanding 6 transitions per peptide rather than 4?
(Like in the webinar where Brendan uses Edit => Refine => min 4 transitions per peptide) prior to adding the decoy peptides.

Best regards any thanks for any help.

Brendan MacLean responded:  2019-07-03

Hi Matthias,
Thanks for the screenshots!

Both in Webinar 14 and the course DIA materials, the library is what is known as a "sample specific" library created without fractionation (as outlined in Bruderer, MCP 2015), but just running DDA on the same sample prep as the DIA. This technique then simply asks DIA and mProphet to find the same peptides that were found with DDA under the same sample prep conditions.

The library you are using, however, contains targets for 8,660 proteins with over 1 million fragment ion transitions. This is a deep proteome-wide library (possibly Rosenberger, Sci. Data 2014 - - PXD000954) likely created with fractionation, and possibly including sample preparation different from your own samples (as outlined in Selevsek, MCP 2015). You should expect a very different mProphet model in this case, because it is far less likely that you can detect all of the targets in DIA under your sample prep.

In fact, the large distribution of targets that form a shape very similar to your decoys should give you confidence that the model is working as expected, because this is what is expected of libraries like this one. You will not be able to detect a lot of the peptides in this kind of library, and those failures are expected to look a lot like your decoys.

If you used the Pan Human library (PXD000954), then the targets should already be limited to 6 transitions per precursor.

So, what was the source of your library? What type of sample are you searching against (e.g. HeLa cell lysate, plasma, or ...) And what instrument did you use with what isolation scheme? How many proteins and peptides are detected a q value < 0.01?

You should not expect this model to ever look like a model from a "sample specific" library, but there may still be things you could do to improve.


Matthias responded:  2019-07-03

Hi Brendan,

First of all, Thank you very much for the fast and comprehensive answer! =)

I must have missed the fact that those were same samples measured twice (DDA & DIA).

We generated the library in house, measuring 16 high pH fractions of a HEK sample on a Q E plus instrument.
The DIA files are also from a HEK sample and we tested different setting using 27 variable windows (3 replicates each).

We were mainly interested in how many point across peaks we would get using slightly different methods (e.g. 80 ms max IT vs 60 ms max IT for each DIA window).

Furthermore, we want to elaborate how many false positives one would find using e.g. an Ecoli library and importing an HEK DIA file.

For this purpose what would be the best way to see how many proteins & peptides are detected with a q value < 0.01 ?
Is there a fast way to see how many peptides were identified & quantified with q-value <0.01 at least 10 points per peak / transition on average ?

Best regards and many thanks again for the great support

Brendan MacLean responded:  2019-07-03

Hi Matthias,
I guess that looks like a bit lower find rate than I would expect from 16 HpH fractions versus unfractionated DIA. How confident are you in your DDA error rates for the HpH fractions. That much fractionation requires careful control of FDR with a tool like Mayu (

You might also try a side-by-side test with targets where you include precursor ion extraction from MS1 (3 isotopes: monoisotopic, [M+1] and [M+2]). I have found the mProphet models that include MS1 precursor information do slightly better and they certainly increase my confidence in the identifications, when precursors with small mass PTMs (oxidation, deamidation) may end up in the same DIA isolation window. Though, this will greatly increase your transition count to where it might become a bit unwieldy in Skyline with > 1.5 million transitions.

While it is not as easy as I would like, it is possible to create a report in Skyline which will give you some information on q value detection. You would need to use the pivot editor, described here:

You could start by creating a per precursor pivot that calculates the min and max q value (Detection Q Value) for each precursor. Then filtering by min < 0.01 would give you all precursors where at least one peak got a q value < 0.01 and doing the same with max would give you all precursors detected in all of your runs. Or you could export a report with Detection Q Value and use R or even Excel to get more fine-tuned results and make queries like detection in at least 1/2 of your samples. The similar protein question might just be a pivot on your peptide pivot again calculating min and max.

We are working on peptide and protein level q values as described in this paper:

Which was really designed for this type of proteomewide fractionation library where error control becomes very important. Error control is less of an issue with the sample specific libraries, because they limit the search space to the point where error accumulation is less of an issue.

Hope this helps. Good luck with your exploration.


Matthias responded:  2019-07-08

Dear Brendan,

What do you mean by lower find rate than you would expect from 16 HpH fractions versus unfractionated DIA?
Would you expect even more than 8000 proteins to be found in a DIA run when searching against a library containing 16 fractions?

I did the analysis with MaxQuant using the fractions option in the experimental design and setting FDR to 1 % on PSM & Protein level.

I also tried to export the MSStats Input report but this seems to have over 4 mio lines, which makes it not that easy to process further.

Maybe a simple / less complex report schema containing only proteins / peptides which were found with q-value < 0.01 would be helpful to get a fast overview about the performance and the data quality with the current skyline settings.

Best regards