Issue 861: Implement better peptide to protein assignment strategies for target selection

Assigned To:Matt Chambers
Opened:2022-01-09 by Brendan MacLean
Changed:2022-01-10 by Brendan MacLean
2022-01-09 Brendan MacLean
Title»Implement better peptide to protein assignment strategies for target selection
Assigned To»
Notify»Mike MacCoss;ben.collins
Here is an attempt to capture my thoughts and what I have heard about protein grouping as it would apply to populating our Targets list when presented with a FASTA file and a spectral library.

Currently, we essentially have 3 less than ideal options:
1. Exclude all non-unique peptides (where the definition of non-unique may be either across FASTA sequences or across source genes)
2. Keep a unique copy of each detected peptide, but assign it to the first FASTA sequence in which it is seen
3. Allow peptide duplication and assign peptides to every FASTA sequence in which they are seen

As far as I know, there are at least 3 other options commonly in use that we have not captured:
1. Razor peptide - MaxQuant - similar to #2 above, but assigning each peptide to the protein with the most other matching peptides (how are ties broken?)
2. Protein grouping by parsimony (see attached PDF provided by Mike, written by Vagisha)
3. What ETH used to do with unique peptides assigned to their proteins, and then non-unique peptides assigned to separate protein groups with all the proteins they appear in. (not sure about this last one, but maybe Ben Collins on the Notify List can chime in whether he thinks this is worth replicating in Skyline)

I think the first places to implement this are:
1. File > Import > Peptide Search
2. File > Import > FASTA
3. View > Library Spectra > Add All with Associate proteins

In #1 above, we already show a form with the options to choose between the 3 options in the first list of currently supported options. It may be simplest to adapt that form to allow for the new options we want to support, and then add a similar choice form to the flow of #2 and #3 above.

We will also need ways to support these protein grouping strategies during "refinement" when the user decided to remove peptides for any reason, e.g. they did a large-scale search and want to apply a detection cut-off. But, that will be for another issue post. In this case, we will start with when we have an upstream tool telling us which peptides were detected.
 MSDaPl - Help Topics.pdf

2022-01-10 Matt Chambers
Thanks for summarizing this Brendan.

Are you aware of a significant difference between parsimony options #1 and #2? AFAIK, protein parsimony should be synonymous with applying Occam's Razor to the protein list. The only difference I can think of is set cover implementation. The PDF Mike provided doesn't mention what implementation MSDaPI uses to calculate the parsimonious set: is it a greedy algorithm, or is it brute forcing to find the minimum covering set (set cover is an NP-Hard problem:, which would be rather slow for large protein/peptide sets? From my reading of the PDF, IDPicker behaves extremely similarly to MSDaPI except when combining datasets (IDPicker recalculates the parsimonious set whenever new data is added/removed). How would you see Skyline handling that (when new FASTA/searches are added to an existing document)? IDPicker actually keeps all proteins in the background in order to recalculate the parsimonious set from the full set. If Skyline entirely forgets the unparsimonious protein-peptide pairs then it would be impossible to do that recalculation. But I don't see a way to do MSDaPI's per-dataset parsimony without having a more structured organization of results in Skyline (i.e. what results are technical replicates of each other, which are fractions of the same experiment, etc).

From our earlier discussions, I don't remember whether you wanted to see this feature modify the document model or simply use the existing free text (?) peptide-node-parent to describe the protein group constituents. In any case the parsimony algorithm should be decoupled from the presentation so I should be able to work on that while we discuss what/how to make model changes.

2022-01-10 Brendan MacLean
Here is a page with some information about MaxQuant, which is where Mike pointed to with regard to the "razor peptide" concept.

Probably my initial summary was too simplistic.

I was thinking that we would need to create a new subclass of PeptideGroup that contains a list of FastaSequence nodes. Ideally, we would preserve as much information as possible in the document data structure and not just resort to the free text peptide list version of PeptideGroup.

Skyline does support a background proteome which is expected to contain complete information about the protein sequences that may be found in the sample, so like what you describe for ID picker. I think it would be good to make it much more common to end up with a background proteome when you import a FASTA file. In the case we are talking about things would get much easier in the future if the user were guided to storing a background proteome DB in the process of setting up their document.

2022-01-10 Matt Chambers
From that link: "MQ will always generate the shortest proteinGroup list that is sufficient to explain all peptide IDs" - this almost certainly means they're doing the same parsimony that IDPicker and MSDaPI are doing. According to the description there, "razor peptides" are describing special handling for peptides that are shared by multiple protein groups where the protein groups are not subsets of each other:
ProteinGroup -- Peptides
A -- 1, 2, 3, 4
B -- 2, 3, 5

Peptide 1, 4 and 5 are unique, 2 and 3 are razor peptides for A but still included for proteinGroup B, and B is retained because it has the unique peptide 5. (Unique peptides are also razor peptides).

The background proteome seems like it could work, but it will need to be updated to store which peptides were observed in the sample before filtering. Can't multiple documents use the same background proteome DB? How would that work?

2022-01-10 Brendan MacLean
Hmmmm... Let's talk about that. It seems you are saying this will add a new meta-level. Today a .protdb is intended to store everything that may hypothetically be in the sample. And you are saying this functionality requires knowledge of what is actually detectable in the sample (below a certain FDR), and that again may be different from what is in the targets list.