Training a peak scoring model with a large-scale of DIA

support
Training a peak scoring model with a large-scale of DIA lihaikuo  2018-04-29 00:53
 
Hi, when I do the procedure (reintegrate--add--train) and want to edit the peak scoring model, I get the figure shown in the attachment.
You see, I can not see the detailed distribution. The horizontal coordinate is too large, and I can not distinguish targets from decoys. This is my question.
Maybe it is due to that there exists some targets or decoys with a very low score? If so, how can I delete these targets?

Here, there are more than 200 DIA raw data imported and more than 200,000 transitions.
 
 
Nick Shulman responded:  2018-04-29 06:46
I have not seen anything that looks like that.
You can right-click on that graph and use the menu item "Copy Data" and paste all those numbers into something like Microsoft Excel. I suspect there is a really low number there, but I have no idea what could cause a number that low.

If you apply the model, then you can see the scores for all of the peptides in the Document Grid. The column is called "Detection Z Score" and it is on the Precursor Result.

I imagine Skyline is doing something wrong. It would help if we could see your Skyline document.
In Skyline, you can use the menu item:
File > Share > (complete)
to create a .zip file containing your Skyline document and supporting files including extracted chromatograms.

You can upload that .zip file here:
https://skyline.ms/files.url
 
Brendan MacLean responded:  2018-04-29 11:55
It does seem like we should use max() and min() to limit that particular graph to some reasonable number of standard deviations from the decoy mean score (normalized to zero). But, yes, it would also be useful to understand what exactly is causing such an extreme value and whether it is a decoy or target.

But, let's just consider the potential features and weights. It seems like Intensity and RT difference are the only candidates. The only other negative coefficient is on Coelution which should vary between 0 and 1. So, it couldn't possibly have this impact. Similarly, intensity is actually log(intensity) and with a weight of -0.037 would require a log(intensity) of around 3000 to achieve a score close to -90. Not sure I have seen a log(intensity) higher than 12. Even RT difference with a weight of -0.3 will require a delta of about 300 minutes to cause this score.

Could you possibly be using full gradient chromatogram extraction on a very long gradient?

This is really why we have the Feature Scores tab. Can you change to that tab and have a look at the graphs for Intensity and RT difference. Do either show any evidence of a peptide or peptides which would cause a score this extreme?

Thanks for posting this to the support board.

--Brendan
 
lihaikuo responded:  2018-05-08 00:20
Hi Nick and Brendan,
Thanks a lot for your response and sorry for my slow reply.

I reintegrated and trained the model again. I use 'Copy Data' in the Model Scores tab, finding there are ~1000 counts with a score lower than -8.6, among ~530,000 counts in total, as the attachment 'model score data.png' and 'model_training data.xlsx' shows. ( I have used Copy Data in all of the tabs, and the datas are all in the excel.)

Brendan for your response, the Feature Scores tabs are attached. The graphs of Intensity and RT difference seem to be normal.
I do not think I used a very long gradient. In our experiment, the gradient of Buffer B is from 3% to 80% in 75 minutes.

Nick for your response, I want to know the meaning of Z Score. Does it mean equally to the 'Composite Score'?
Can Skyline help me filter the targets with a composite score lower than -8.6? I want to delete these targets before I reintegrate and train again. It is not practical to delete many targets one by one.

I do not think there is something wrong in our experiment. But I admit that there exists several DIA raw datas with a poor quality (due to the property of our samples) and these DIA datas were also imported. Did this lead to the emergence of targets with a low score?
Maybe I should only import DIA raw datas of a good quality?

Thank you very much and look forward to your reply.
Haikuo
 
Nick Shulman responded:  2018-05-08 09:06
Yes, the Detection Z Score is the same as the composite score. It is the number that is obtained by summing the features scores multiplied by the feature weights. The Detection Z Score is also supposed to represent a number of standard deviations, such that there should be very very few peptides with a Z-score above 4 or below -4.

It would be interesting to see what is causing that score of -86 on that one peptide of yours.
 
lihaikuo responded:  2018-05-11 05:31
Hi Nick,
Thanks for your reply.

I tried to export the report with the Z Score.
But the exported .csv file showed that most of the scores of peptides are #N/A. The other scores are mostly around 3.
What does #N/A mean? A pretty high score or low score?
How can I know which peptide has the Z Score of -70?

Thank you very much.
Haikuo
 
Nick Shulman responded:  2018-05-11 10:35
In your report, it looks like you have checked the box "Pivot Replicate Name" checkbox so that each replicate appears in a separate column. A value of #N/A means that there is no value available for that particular peptide in that particular replicate.
If you are looking for the lowest value of the Detection Z Score, you should probably uncheck the "Pivot Replicate Name" checkbox and then just sort on the Detection Z Score column.

You might want to take a look at the Skyline custom reports tutorial:
https://skyline.ms/wiki/home/software/Skyline/page.view?name=tutorial_custom_reports