Standard deviation in quant experiments julius fuersch  2019-10-16 01:55
 

Hi guys,

I have a very general question regarding error distribtion in quant proteomics experiments. Often the standard deviation and therefore the p-value is calculated using log values of the extracted ion chromatogramm areas. By using the log values instead of the original values you will get a completely different distribution and a much lower relative standard error! Why do people use the log areas instead of the original values? In my very small understanding of statistics, this artifically improves the test statistics and makes the results better than they appear! I would be very happy about an explanation because we are using an in-house developed cross-link quant software which also does it in that way and I would really like to understand the background!

Thanks a lot in advance

Julius

 
 
Nick Shulman responded:  2019-10-16 03:08
Julius,

It sounds like you might be asking about the Group Comparison feature in Skyline. That is the only place in Skyline that I know of where Skyline takes logarithms and calculates p-values.

With the Group Comparison feature, Skyline is trying to calculate the fold change between your two groups of replicates: that is, the amount that the observed values in one group need to be multiplied by in order to be equal to the observed values in the other group. The reason that Skyline operates on the logarithms in this case is so that the fold change can be calculated using a linear regression.

In order to calculate a p-value for you in a Group Comparison, you need to have multiple replicates in both groups. The p-value is a measure of how much greater the variance is between the two groups of replicates compared to the variance within the groups.

You can learn more about the Group Comparison feature in this tutorial:
https://skyline.ms/wiki/home/software/Skyline/page.view?name=tutorial_grouped

I hope this helps,
-- Nick
 
julius fuersch responded:  2019-10-17 06:50
Hi Nick,

thanks a lot, this is exactly the case I am asking for. We are also calculating log2 ratios of peptides (in our case cross-linked peptides, but this doesnt play a role for the calculation). Means average of three replicates vs average of three replicates = ratio. To do this you can use the original MS areas or you can form the log2 of the original MS areas. Therefore you can calculate a standard deviation based on the original values or the log2 values, the same is true for the p-value calculation. But the relative standard deviation of the log2 values will always be much lower than for the original values. However, the log2 ratio will be equal independent on the timepoint you are using the log2. I have attached a tiny example calculation. I am confused on how I should judge the reprodicibility of my dataset, based on which relative error?

Thanks a lot

Julius
 
Nick Shulman responded:  2019-10-17 07:41
Julius,

You should not divide your logarithms. When you want to calculate the ratio using logarithms, you are supposed to subtract one logarithm from each other.

So, with your data you got log2(group a) was 13.21 and log2(group b) was 12.13.
The fold change is 2 to the power of (13.21 - 12.13) which ends up being 2.27. That's close to the other answer (2.11) that you got without the logs.
I think that the difference between 2.27 and 2.11 comes from the difference between taking the average of the logarithms versus the logarithms of the average.

That's not how Skyline calculates fold changes.

Skyline plots the logarithms of the observed values as y-values on a graph. The x-values are zero or one depending on whether the replicate is in the control group or not.
After you have done that, the slope ends up being the logarithm of the fold change between the two groups.

I have attached my spreadsheet where I use the Excel function "LINEST" to do the linear regression.
The Excel documentation says that there's some way to get the standard error of the slope out of the LINEST result, but I could not figure that out. (something involving the function called "Index").

You multiply the error in the slope by the T-distribution value from here:
https://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values
and that gives you the confidence interval around the slope. If you raise two to the extremes of the slope confidence interval you end up with the confidence interval of the fold change.

-- Nick
 
julius fuersch responded:  2019-10-21 08:43
Hi Nick,

thanks again. Now, I fully understood the calculation. But the most crucial point of my question is still unclear which is in line with your sentence:

" I think that the difference between 2.27 and 2.11 comes from the difference between taking the average of the logarithms versus the logarithms of the average"

The relative error when taking the logarithm of the original intensity values is much smaller compared to the relative error of the non-log original intensities. Means: Taking the log of the original values a dataset always appears to be very reproducible (very low error) but the error of the non-log intensities hints to a dataset of lower reproducibility.

So, the most important question is: Which is the correct way to calculative the relative error?

Thanks a lot again

Kind regards

Julius
 
Nick Shulman responded:  2019-10-29 10:46
If you want to calculate the fold change of an observation between two groups, my understanding is the best way to do this is to plot them all on a graph, where the Y value is the logarithm of the observed values, and the X value is zero or 1 depending on which group the observation belongs to.
Then, the slope of that linear regression is the logarithm of the fold change between the two groups.

Whenever you do a linear regression, there is a standard way to calculate the confidence interval in the slope, which is then the confidence interval around the logarithm of the fold change.

I am not sure why doing this linear regression is better than just dividing the averages of the two groups. The statisticians who developed MSstats told me this is the best way to calculate the ratio of the observations of two groups.

You could certainly calculate the fold change by dividing the mean of one group by another. In that case, you might have a confidence interval around the numerator and a different confidence interval around the denominator. Then, if you wanted to figure out the confidence interval around the ratio, you would apply the formula for propagating errors when dividing.

Here's a document which describes how to propagate errors when you are doing complicated calculations such as division etc.
http://ipl.physics.harvard.edu/wp-uploads/2013/03/PS3_Error_Propagation_sp13.pdf

-- Nick