Missing Values in DIA data analysis

Missing Values in DIA data analysis klemens froehlich  2021-12-09

Dear Skyline developer team,

We are currently in the review process of a manuscript where we compare multiple different DIA data analysis suites /spectral library generation methods and data analysis strategies using a large scale benchmark dataset (92 samples with human background / Ecoli spike-in).

We have one condition where no Ecoli proteins are present.

Following the tutorial:
"Analysis of DIA/SWATH data in Skyline" we exported our data on protein level and found that for most Ecoli proteins values are reported in the samples where no Ecoli is present.

My question is this: Which additional filtering options would you recommend to avoid such background quantifications? They are lower than the intensity values of the spike-in conditions so at this point we would just advise users of Skyline to be aware of this "background" reporting of protein intensities.

Best, Klemens

Nick Shulman responded:  2021-12-09
Skyline tries really hard to find a peak for the peptide, even if the peptide is not present in the sample.

One thing that you can do is train an mProphet model, and assign q-values to the results, and tell Skyline not to integrate the peaks with bad q-values.
You can learn about training an mProphet model here:

I know in Skyline-Daily there are some changes to make it easier to get q-values without having to train an mProphet model. In Skyline-Daily if you have decoy peptides in your document when you extract chromatograms, those decoys will be used to calibrate (i.e. quantify what qualifies as a "good-looking peak") Skyline's default peak picking model, and your results will be assigned q-values.

I know we have plans for future versions of Skyline to have a more robust way of deciding that a peptide is not present, but, for now, the "Detection Q Value" column in the Document Grid is the only indication Skyline can give you that the peptide is probably not present.
-- Nick
Brendan MacLean responded:  2021-12-09
Hi Klemens,
Can we have a look at this case? If you followed the DIA/SWATH tutorial, then I think you already trained an mProphet model and used a 0.01 q value cut-off for peak detection, as Nick was suggesting. So, it sounds like you are saying that the model reported E. coli peptides and proteins detected and quantified. Though, hopefully not changing at an adjusted p value of less than 0.05.

If you feel this is quite different from other tools, then likely there is some issue not yet identified in your settings. In every prior comparison we have been involved in, we have found that Skyline, OpenSWATH, and Spectronaut perform quite similarly with comparable settings.

Thanks for seeking our input on your comparison.

klemens froehlich responded:  2021-12-12
Dear Nick and Brendan,
I think I found the mistake: I trained the mPROPHET model but forgot to again go into the reintegrate settings and apply the q value fitlering.

I would be happy to share the document with you, but it is 22GB in size. I could send you a PM / Email with a drive link if that is okay for you.

I now face another challenge: When I export protein abundances, I get a lot of NAs, even for proteins where I have some peptides which have quantitiations after q value foltering with mPROPHET model.
I do not know how Skyline behaves when it comes to protein abundances when there are partial NAs in some peptides in some samples due to q value filtering of integration. Can you comment on that please?

Best, Klemens
Nick Shulman responded:  2021-12-13
If any of the transitions have missing values, then the Protein Abundance value for that particular replicate will be #N/A.
This is because we did not think that it would make sense to compare protein abundance values across replicates if those abundances were obtained by summing across a different set of transitions.
The Protein Abundance value will also be #N/A if any of the transition peaks are "truncated", which is when the edge of the peak coincides with the edge of the extracted chromatogram. A truncated peak's area cannot be relied on, so in terms of the Protein Abundance calculation, a truncated peak is treated the same as a missing peak.

It sounds like that part of Skyline is behaving as expected.

I will send you an email directly in case you still want to send us a file link.
You are also welcome to upload files of any size (even 22GB) here:
-- Nick
klemens froehlich responded:  2021-12-13
Hi Nick,
Sent you the drive link.

We specifically wanted to compare protein level quantification between different tools and especially how different libraries <-> DIA analysis tool combinations influence downstream analysis especially statistical testing tools....

Would you then rather recommend using a different protein quantification approach? Instead of summing up the transitions? Can Skyline do that? I think msstats can do something like this?

Best Klemens
Nick Shulman responded:  2021-12-13
If you want to compare the amount of proteins between different groups of replicates, the way that MSstats does, you should use the Group Comparison feature in Skyline.

You can learn about the Group Comparison feature in this tutorial:

The way to configure a group comparison in Skyline is with the menu item:
View > Other Grids > Group Comparison > Add
If you push the "Advanced" button in the "Edit Group Comparison" dialog, you can tell Skyline to use Tukey's Median Polish and you can also tell Skyline to assume that missing values are equal to zero.

I am not sure exactly what combination of those two settings we recommend for dealing with missing values in the group comparison (I would try starting with just Tukey's Median Polish and see if that works). There might be something in this article that says what the best thing to do is:

-- Nick
klemens froehlich responded:  2021-12-16
hi Nick,

Thank you for your input on this matter!

As we want to test different statistical approaches we would need the protein abundance matrix in R.

Would this be an option:
performing the DIA analysis,
train the mPROPHET model,
apply the integration Q Value filtering,
export an msstats report (as this is on fragment intensity if I remember correctly)
and then do protein inference / summarization in msstats using R

But this means that Skyline itself does not allow meaningful protein abundance estimation when a q value filtering is applied by an mProphet model? Because as you said summarization of different transition numbers would not yield good results?

Best, Klemens
Nick Shulman responded:  2021-12-16
Yes, exporting the "MSstats Input" report and then doing calculations in R would be a fine thing to do.

"Summarization" is the process of taking the multiple transition peak areas that you have for a protein and turning that into one number. When doing a group comparison in MSstats or in Skyline, there are a couple of options for how to do the summarization. One option is "summing", which is just adding up the transition peak areas. A different summarization option is "Tukey's median polish".
Yes, I think that "summing" is not a good summarization technique if there are missing values. I think "Tukey's median polish" is designed to yield meaningful results when there are missing values.

It sounds like Skyline should offer to do Tukey median polish when calculating the Protein Abundance number. We could probably release a Skyline-Daily in January which had that feature in it.

I am not sure what sort of calculations you are going to be doing in R, but if they yield good results, it might be helpful if you could send us the code so that we could try to implement the same thing in Skyline.
-- Nick