Scaffold Scoring Models and Filtered Data

Scaffold has two options for generating probabilities from search engine scores, LFDR or PeptideProphet. The following document gives a brief explanation of both. 

LFDR (Local False Discovery Rate)

In this method, peptide identifications are validated by discriminant scoring using a Naïve Bayes classifier generated through iterative rounds of training and validation to optimize training data set choices. Peptide probabilities are assessed using a Bayesian approach to local FDR (LFDR) estimation. Rather than just using mass accuracy as a term in discriminant score training, peptide probabilities are modified by likelihoods calculated from parent ion delta masses.

Like other scoring methods, LFDR incorporates multiple scores when they are reported by a search engine. Instead of PeptideProphet’s LDA or Percolator’s SVM classifier, LFDR uses log-likelihood ratios generated by Naïve Bayes classifiers to discriminate between target and decoy hits. Naïve Bayes was chosen specifically for robustness to over-fitting, a frequently occurring problem when training a classifier on a subset of testing data.

PeptideProphet

When using PeptideProphet Scaffold determines the distributions of the scores assigned by a search engine like SEQUEST, Mascot, MaxQuant or others, which depend on the database size used for the search and the specific characteristics of the analyzed sample, see Keller (2002). From these distributions, Scaffold translates the search engine scores into the probabilities that a given identification is correct. Scaffold’s probabilities can then be used as threshold filters, allowing the identifications to be viewed at various confidence levels.

Scaffold’s method contrasts with SEQUEST’s, which uses an XCorr cut-off that depends on neither database size nor sample characteristics, frequently requiring ad hoc corrections for these parameters. Scaffold’s statistical approach yields more reliable estimates of the probability of a correct identification.

Scaffold’s method also supplements Mascot’s. Mascot provides a probability estimate based on database size, but not on sample characteristics. By incorporating the sample-specific distribution, Scaffold provides better estimates of the probability of a correct identification.

Note, if you select the LFDR option but the data has no decoys present, Scaffold will automatically fall back on using PeptideProphet as decoys are not required to run PeptideProphet. PeptideProphet is compatible with decoy searches and if you load data with decoys and select PeptideProphet specifically, they will be taken in consideration. Finally, most data produced with today's mass spectrometers should be loaded with the PeptideProphet for high mass accuracy instruments.

ProteinProphet

ProteinProphet groups the peptides by their corresponding protein(s) to compute probabilities that those proteins were present in the original sample, see Nesvizhskii (2003).

In Scaffold 4 modified weights for protein probability calculations are used in the ProteinProphet algorithm to more accurately model peptide assignments. The Similarity View has been modified to reflect these changes by reporting the peptides weights used as percentages when the User selects to group the data using the clustering algorithm.

Loading Data into Scaffold

Note, both of these models use the identified PSMs to generate probabilities. Thus, the more PSMs that are fed into the scoring model, the better the Scaffold results will be (in terms of proteins identified and the actual probabilities assigned). Often, search engines will export filtered data as if you are not running additional probability models, that is what you want to see. This however, can cause problems when the data is loaded into Scaffold. The course of action here is to make sure that the search engine results are not filtered. In MaxQuant, this means setting the peptide and protein FDRs to 1. This will ensure that all of the PSMs, whether good bad or decoy are being fed into the chosen Scaffold probability model. If you do not have access to unfiltered data it is recommended that data is loaded using Prefiltered mode. This will allow you to use Scaffold to view and analyze data but it will bypass the probability models.

Proteome Discoverer is the exception here. In Scaffold 4.11, we introduced the ability to read probabilities assigned by Percolator. These probabilities are automatically read into Scaffold when PD data is loaded using Prefiltered mode. When loading PD data, we recommend using the Percolator note with FDR thresholds set to a reasonable value, then load the data using Prefiltered mode. If the Prefiltered mode option is greyed out it is likely because you have condensing on, disable condensing and Prefiltered mode should be active.

Have more questions? Submit a request

0 Comments

Article is closed for comments.