The strength of the Affymetrix system is that multiple distinct oligonucleotide probes on each chip represent every gene. However the signals from the different probes for the same gene aren’t the same; signals from individual probes for the same gene may differ, on the same chip, by as much as two orders of magnitude (a factor of 100). See Figures 1 and 2. The sequences are different, and the probes have different hybridization constants for their target: the most important factor in signal intensity is C:G content. How do we combine signals from the many probes for a gene, into a single estimate of the abundance of that gene?
Figure 1. Images of probes from human GAPDH probe set extracted from an Affymetrix U95A chip image. PM probes in top row; corresponding MM probes on bottom. Two probes are bright, three others are moderately bright, the rest are dim.
Figure 2. Line representation of intensities from a typical probe set in the mouse chip. PM values appear in blue; MM in green. Vertical axis height represents 30,000. Pale blue lines represent standard deviations of probes across chips. Image from dChip.
There has been considerable discussion over the appropriate algorithm for constructing single expression estimates based on multiple-probe hybridization data. To date, over a dozen different methods have been published, which aim to synthesize the different readings from the various probes for a gene, into a single estimate of transcript abundance. Affymetrix recently sponsored a conference on the topic.
Affymetrix has upgraded their MicroArray Suite (MAS) software several times over the short history of their product. MAS 4 was the standard until January 2002 and is still the most commonly cited measure in published papers. The simplest way to get one number from several numbers is to take an average. MAS 4 calculates a robust average of the probe-pair differences (PM – MM) for each probe pair representing a gene. The more recent MAS 5 improves in three ways: first the difference is taken between PM and an estimate of background based on MM (rather than MM itself); secondly the intensities are transformed to a logarithmic scale before the average is taken; third the average is a more sophisticated robust mean (Tukey biweight).
MAS 5.0 computes local background in each of 16 squares, and then subtracts a weighted combination of these background estimates at each probe. For each probe set, compute a robust average of log probe pair differences: log(PMj/MMj). Call this SB. Then adjust each PM probe as follows: if MMj < PMj, then log2(PMj/MMj) is used; if MMj>PMj, then log(PMj) – SB is used, unless SB is too small. See the “Statistical Algorithm Description Document” from Affymetrix, for more details.
The idea of averaging different probe intensities for the same gene is seems quite wrong. It is like averaging the angular height of a building seen from different vantage points; or measuring a person’s height in inches, feet, cm, ells, furlongs, and meters, and taking the average; or averaging the readings from scans taken at very different settings. A second failing is that there is no 'learning' about probe characteristics, based on the performance of each probe across chips.
A chemical motivation for multi-chip models comes from reasoning that the amount of signal from one probe in a gene’s probe set, should depend both on the amount of that gene in the sample, and on the specific affinity of the probe for that gene’s mRNA. The statistical motivation for multi-chip models is observing that the signals from individual probes move in parallel across a set of chips (this is clearer with the better normalizations). See Figure 3. Another way to see this is to watch the animations of probe sets in dChip.
Figure 3. Probe signals from a spike-in experiment. The concentrations are plotted along the horizontal axis (log scale), and the probe signals are plotted on the vertical axis (log scale). Each probe is represented by one color. The different probe signals change in parallel. Image courtesy of Terry Speed
We want a statistical model that estimates both the factors probe affinity, and gene abundance. Statisticians like linear two-factor models: that means, the errors in each data point have similar variances, and the two factors combine in a simple way. If the signal from each probe is proportional to both probe affinity, and gene abundance, then it must depend on the multiplicative product. Suppose for one target gene, the chip has a set of probes p1,...,pk; each probe pj binds to the target with affinity fk. Suppose in each sample i an the gene occurs in amount ag. Then the intensity of probe j on chip i should be proportional to fk ai. See figure 4.
Figure 4. Ideal linear model relationship among intensity (height of green bars), abundance of transcript (ai), and probe affinity (fj).
In practice, the discrepancies between data and ideal model, include frequent outliers, besides the usual random fluctuations in signal intensities. Outliers are measures that lie far beyond the typical range of 'noise' (random variation). These may be due to scratches, or uneven heating, or other artefacts. See Figure 1 in Quality Control. Typically 10-15% of probes in an Affymetrix chip are outliers. Most methods to fit data flounder badly on data with this many outliers. One approach is to try to identify the outliers, and exclude them; this is the Li & Wong approach. Their method proceeds in this cycle: fit, identify outliers, throw out outliers, and fit again. Another approach is to use a robust fitting method. Robust methods try to fit the majority of data points quite well, but willing to fit a small fraction quite badly. Some such methods are median polish, or IRWLS (iteratively re-weighted least-squares), which are implemented in RMA. Another approach is least median squares, which is not implemented.
Li and Wong originally suggested the model PMij - MMij = fk ai + eij, following on from MAS4. Since then they have found better fits with the model PMij = fk ai + eij, (PM-only). Li-Wong assumes that the noise in all the probe measures is roughly same size. In practice all biological measures exhibit intensity-dependent noise. (see Figure 4 in Distributions). The effect of their assumption is that probes with smaller variation are ignored, even though this variation may be measuring real differences. Fortunately the bright probes are often the most specific, and it does little harm to ignore the majority of probes, if the bright probes are good. They have tuned their fitting procedure to try to reduce the emphasis on the very bright probes, but this has resulted in often throwing out a good bright probe probe as an outlier.
Figure 5. A probe set with values (represented by red lines) fitted to actual PM-MM values (in blue).
This is largely the work of Terry Speed's group at Berkeley, especially Ben Bolstad, and Rafael Irizarry. They work only with PM values, and ignore MM entirely. They take a log transform of equation () and find
With errors proportional to intensity in the original scale, the errors on the log scale have constant variance. After background subtraction and normalization they fit:
where nlog is their terminology for ‘normalize and then take logarithm’. They fit this model by iteratively re-weighted least squares, or by median polish. Code is available in the affy package on BioConductor, together with quantile normalization.
This appears to be the best overall method available. See figure 6. Comparing the performance on replicate arrays – so criterion is noise should be small. four strata of genes – lowest to highest expression. Measures were computed for each and standard deviation. MAS5 apparently does a decent job on high abundance genes, but the multi-chip models do better on low-abundance genes, such as transcription factors, and signalling proteins. Affymetrix has seen the evidence, and they are planning their own multi-chip model. However details are not being revealed. Furthermore the marketing people at the company want to remove information from the public domain. This will hinder further improvements to the model, and prevent people from using the best analytic tools for their data.
Figure 6. Comparison of MAS5 (green), dChip (black), RMA (blue), RMA (red): The genes have been divided into quarters based on average expression. Each boxplot represents the standard deviation of genes in one fraction. Note that the multi-chip models do almost ten times better than MAS5 on the low-abundance genes; this category includes most transcription factors and signalling proteins.
Li and Wong’s method is available through their program dChip, at www.dchip.org. Academic licenses are free.
The RMA method is available as part of the affy package in the Bioconductor tools suite: see www.bioconductor.org. There is also a windows standalone from Ben Bolstad’s web site. A commercial software vendor, Iobion, has incorporated RMA into their GeneTrafic product.