The following document describes the normalization procedure employed in Scaffold Q+S.
Raw intensities are acquired from spectra and purity corrected as appropriate. The data is then log2 transformed. Within each MS-Sample, missing values are imputed as the larger between either the minimum positive logged intensity acquired, or the value whose z-score is -4 for the distribution of all logged values in the MS-Sample. These missing values are applied to all intensities that have a raw value of zero or fall below the "Minimum Dynamic Range".
Removal of Spectra from Quantitation
A spectrum can be removed for any of the following reasons
- In precursor intensity quantitation, we only pick one spectrum to represent the precursor.
- The spectrum has no quantitative data
- The spectrum is not exclusive to a single protein, unless the user has selected to use non-exclusive spectra.
- If a spectrum quality filter has been applied. For example the, the "Reference Value Required" filter removes spectra after normalization if their reference value marked as a missing value.
Iterative Median Polish
A version of Tukey's Median Polish is applied iteratively to normalize the data. In the following steps, all calculations a preformed on log2 transformed data and "average" denotes either median or mean depending on the mode selected by the user. Scaffold Q+S uses medians by default. There are four steps to the normalization process.
- Inter-Sample Normalization: A normalization factor consisting of the global average minus the within-MS-Sample average is added to each quantitative value.
- Intra-Sample Normalization: A normalization factor consisting of the within-MS-Sample average minus the within-channel average is added to each quantitative value.
- Peptide/Spectrum Normalization: For each protein, the averages across all quantitative values in each spectrum are brought into alignment by adding the per-spectrum normalization factor. This the average of these averages minus the particular spectrum's average.
- Intensity Weighting: A weight is applied to each spectrum based on a t-statistic derived from percent derivations from channel averages.
Steps one through four are repeated three times.
A standard deviation estimate, based on smoothed within-protein deviations of spectral quantitative values from averages binned by total intensity, is derived for each spectrum. The weight of each spectrum is divided by the total number of spectra matched to its peptide., providing a form of intermediate peptide-level averaging when subsequently computing protein-level quantitative values.
When dealing with SILAC data as opposed to iTRAQ or TMT data, Scaffold employs ratio based normalization. The procedure is similar to that found above with a few key differences. Note, only means can be used for ratio based normalization and the "Individual Spectrum Reference" reference type is automatically applied.
First, missing values are imputed as above. Then, the ratio of of each quantitative value to its reference in calculated. These ratios are log2 transformed and for each quantitative sample, the average log-ratio is determined.
The average log2 ratio is then determined from each log2 ratio in the quantitative sample with the result that the average log2 ratio for each quantitative sample is zero. The allows quantitative samples from different MS-Samples to be combined without further normalization.
Raw intensity values are stored as multiplexes. There are four channels per spectrum in an iTRAQ 4plex experiment for example. These spectra are contained in MS-Samples. Note, we are in Log space so really we have log2(intensity). Note, average here can either be weighted median or mean.
For Inter-sample Normalization
Log2(intensity) values are adjusted by adding the (average of all quantitative values in any MS-Sample in any channel - average of all quantitative values in the same MS-Sample in any channel in that MS-Sample).
For Intra-sample normalization
Log2(intensity) values are adjusted by adding the (average of all quantitative values in the same MS-Sample in any channel - average of all quantitative values in the same MS-Sample in the same channel).
Note that since logb(x) = C * log2(x) where C = 1/log2(b) all log2(intensity), changing the base of the logarithm would simply change all factors by a constant, and the effect of the normalization would be the same.