Normalization of Affymetrix Chips

Normalization by Scaling and its Limitations

The simplest approach to normalizing Affymetrix data is to re-scale each chip in an experiment by its total intensity, as described in the Normalization Introduction. Variants of this approach, scaling by trimmed mean intensity, or by median intensity, are widely available in commercial software.

Affymetrix introduced a new approach for their 133 series chips, using a set of 100 'housekeeping genes': the chips are re-scaled so the average values of these housekeeping genes are equal across all chips. The author believes these approaches are adequate for about 80% of chips in practice.

To do better, we examine in detail the relationships among replicate chips (chips hybridized to the same sample). Figure 1 shows a scatter plots of probes from one pair of chips; there is clearly a non-linear relation among probes. Figure 2 shows plots of probe distributions from a number of replicate chips; these distributions have very different shapes; any scaling transform applied on a log scale, will shift the distribution curve to the right or left, but not change its shape. Finally figure 5 shows R-I plots of pairs of Affymetrix replicate chips; a scaling transform will shift the R-I plots up or down, without changing their configuration. For perhaps 80% of chips, (perhaps 65% of pairs), the relationship is close enough to linear that a scaling transform will get results to within 20% of the best possible. The relationships among different chips are quite non-linear in perhaps 20% of cases. We want to correct that to get the best possible accuracy.

Figure 1. Plot of probe signals from two Affymetrix chips hybridized with identical mRNA samples. The black straight line represents equality, while the blue curve is a spline fit through the scatter plot.


Figure 2. Density of PM probe signals on 23 different chips from GeneLogic spike-in experiment (Courtesy of Terry Speed)

Two-Parameter Methods

Two-parameter methods can do better, at the expense of greater complexity. MAS5 introduced a reference (baseline) chip method using linear regression. The procedure is to construct a plot of each chip's probes against the corresponding probes on the baseline chip; eliminate the highest 1% of probes (and for symmetry the lowest 1%). Fit a regression line to the middle 98% of probes.

Another two-parameter approach is to both re-scale and shift the origin, in order fit both the mean and the standard deviation of the probe distribution to the common mean and standard deviation of all data. This seems to do somewhat better than regression, in reducing noise (variation among replicate measures on the same sample), at the cost of (sometimes) introducing a few negative values.

Invariant Set Normalization

Li and Wong introduced a method, where a large number of genes are selected ad-hoc as references, rather than using a standard set of 'housekeeping genes'. Their method assumes that there is a subset of unchanged genes, between any two samples. Their method selects a subset of genes g1, …, gM, whose probes: p1, …, pK, (K ~ 10000), occur in the same rank order on each chip such that p1 < p2 < …< pK in both chips (an invariant set); then fits a non-parametric curve (running median) through the points { (p1(1), p1(2)), …, (pK(1), pK(2)) }. Ideally one would like a common invariant set of reference genes across all chips, but in practice, only a very few probes are in common rank order, or even close to that, across all chips.

Quantile Normalization

Terry Speed’s group introduced a non-parametric procedure normalizing to a synthetic chip. Their method assumes that the distribution of gene abundances is nearly the same in all samples. For convenience they take the pooled distribution of probes on all chips. Then to normalize each chip they compute for each value, the quantile of that value in the distribution of probe intensities; they then transform the original value to that quantile’s value on the reference chip. In a formula, the transform is

xnorm = F2-1(F1(x)),

where F1 is the distribution function of the actual chip, and F2 is the distribution function of the reference chip.

Figure 3. Schematic representation of quantile normalization: the value x, which is the a-th quantile on the chip, is mapped to the value y, which is the a quantile of the reference distribution.

In practice this transform is non-linear, but not usually too different from straight. See Figure 4. In practice this removes most of the apparent bias from the R-I plot. See figure 5. It also reduces variance among replicates, much more than normalization by scaling.

Figure 4. Some typical transforms by quantile normalization. Many are nearly linear, but some are quite non-linear.

Figure A. Ratio Intensity Plot of all probes for four pairs of chips from GeneLogic spike-in experiment

 Figure B. As in A, after normalization by matching quantiles. Both figures courtesy of Terry Speed

Local Regression

We construct a synthetic reference chip by averaging the values of each probe across all chips.


Critical Assessment

Ideally we would like a method that is based on some understanding of the hybridization process, and uses simple statistical procedures, to bring all chips to a common reference. Scaling is simple, but seems to be inaccurate. Methods based on multiple house-keeping genes, such as the MAS method for the 133 chip, and the Li and Wong  method, appear promising, however they would work better if the reference set of genes were similar across all chips. These methods use a single chip reference, so peculiarities in that chip are forced onto all the others. Quantile normalization uses a single standard for all chips, however it assumes that no serious change in distribution occurs. This appears to be a rather strong assumption about gene distributions; however, in practice genes move up and down roughly equally; it would need several hundred genes to be changed greatly and in one direction, to drive quantile normalization in error by more than 20%. This may well be true in studies of senescence, or interference with basal transcriptional apparatus, or selective comparisons of RNA's attached to ribosomes, and perhaps in extremely malignant tumors.