Q-value

support
Q-value Tobi  2019-10-22 06:39
 

Dear Skyline Team,

filtering for 1% FDR by mProphet model is easy in Skyline by filtering for Detetction Q value <= 0.01. However, we have some issues with that and would like to export data of target and decoy peptides, filter the data ourselves and then perform the 1 % FDR cutoff.

What would be the most feasable way to perform the 1% FDR filtering (Q value calculation) outside of skyline with only z-scores available? Is it a simple formular or more complex? In case it requires complex code, could you please link it? (Noncoder here and I already checked the tutorials and cited paper but did not find q value mentioned seperately)

A short comment is highly appreciated.

With best regards,
tobi

 
 
Brendan MacLean responded:  2019-10-22 09:08

Hi Tobi,
It is non-trivial, but it starts with pnorm(z-score) to convert the z-score to a p value. Then we use the Storey-Tibshirani, 2003 algorithm (https://www.pnas.org/content/100/16/9440) to convert the single hypothesis test p values into q values (a.k.a. adjusted p values, or FDR estimates).

You don't even need the decoys at this point. They have exhausted their usefulness during the model training (for training the relative weights of the scoring function coefficients and then estimating scaling and bias to make the final scoring function produce z scores based on the assumption of a normally distributed decoy score population).

Unless that is you want to pursue some more complex method of estimating p values than pnorm(z-score), like:

p value raw = countDecoys(z-score)/countDecoysTotal
p value = p value raw != 0 ? p value raw : Math.Min(1/countDecoysTotal, pnorm(z-score))

This is an alternative implementation, which I believe is at least close to what Percolator uses. It is what is called non-parametric p value estimation (i.e. it doesn't rely on the assumption that the decoy scoring population is normally distributed). People using the non-parametric approach initially found, however, that it had a tendency to estimate a pvalue of zero too quickly (what I call above p value raw) because it becomes zero for any target score that is greater than all seen decoy scores. So, it is still best to decide on some parametric distribution to use as back-up beyond the end of the sampled decoy scoring distribution.

Implementing Storey-Tibshirani is non-trivial, but most of the code I have seen to do it doesn't actually fit a spline to estimate pi zero at p value = 1.0. If you use a fixed "lambda" (Skyline uses 0.4 - the original mProphet used 0.5 - https://static-content.springer.com/esm/art%3A10.1038%2Fnmeth.1584/MediaObjects/41592_2011_BFnmeth1584_MOESM181_ESM.pdf#page=39) then PiZero is not that hard to calculate (from Skyline's Statistics class below):

    /// <summary>
    /// Returns a single point Pi-zero (proportion of features that are truly null)
    /// estimation, given a specific p value cut-off, where values in the list are
    /// assumed to be well behaved (randomly distributed for false-postives) p values.
    /// See Storey and Tibshirani 2003
    /// </summary>
    /// <param name="lambda">P value cut-off</param>
    /// <returns>Pi-zero estimation</returns>
    public double PiZero(double lambda)
    {
        int aboveLambda = 0;
        for (int i = 0; i < _list.Length; i++)
        {
            if (_list[i] > lambda)
                aboveLambda++;
        }
        return aboveLambda/(_list.Length*(1 - lambda));
    }

Once you have a PiZero (proportion of features which are truly null) then you calculate q values on your set of p values like this:

    public double[] QvaluesNorm(Statistics s)
    {
        return new Statistics(PvaluesNorm(s)).Qvalues();
    }

    /// <summary>
    /// Returns q values for a set of p values, optionally using a p value cut-off
    /// used for calculating Pi-zero.
    /// See Storey and Tibshirani 2003
    /// CONSIDER: Use spline fitting described in the paper instead of a supplied
    ///           cut-off.
    /// </summary>
    /// <returns>New statistics containing q values for the p values in this set</returns>
    public double[] Qvalues(double? lambda = null, double minPiZero = 0)
    {
        double pi0 = Math.Max(PiZero(lambda), minPiZero);
        var ordered = OrderedIndexedList();
        int n = _list.Length;
        var qlist = new double[n];
        for (int i = 0; i < n; i++)
        {
            double pVal = ordered[i].Key;
            int iOrig = ordered[i].Value;
            double qVal = pi0*n*pVal / (i + 1);
            qlist[iOrig] = qVal;
        }
        // Enforce that q values are monotonically increasing and never greater than 1
        double last = 1.0;
        for (int i = n - 1; i >= 0; i--)
        {
            int iOrig = ordered[i].Value;
            double qVal = Math.Min(qlist[iOrig], last);
            qlist[iOrig] = qVal;
            last = qVal;
        }
        return qlist;
    }

Sorry to get a bit code-heavy at the end here. Ignore the code if you want and return to the base citation of Storey-Tibshirani, but this is how it gets implemented in practice and inside Skyline.

Good luck. Can you provide anymore detail on why you would want to attempt this outside Skyline based on our z scores? Seems highly prone to error to me, unless you have the time to validate your implementation as much as existing implementations in Skyline, Percolator, PyProphet and Spectronaut have been validated.

--Brendan

 
Tobi responded:  2019-10-22 10:18

Thank you very much for the fast and wholesome reply, it helps a lot and also the code does not hurt at all.

One of the issues I see is that peptides where "remove peak" was applied result in a dotp=0 as expected but still show very high z-scores and low q-values contrasting expectations. Will need some time to play around this and in case it will look like a potential bug then I will send you a testfile, but at first I want to take a look at user errors.

Thank you and best regards,
tobi