Statistical Tests

The Purpose of Formal Statistical Tests

Microarray data is used sometimes as a guide to further, more precise studies of gene expression, and sometimes the microarray data itself is presented as evidence for changes in gene abundance. If the expression data is used as evidence, then the experimenter must present the degree of evidence – ie. the correct ‘p-value’ – for each gene selected out of the thousands in the array. Most microarray papers present p-values for individual genes more significant than they should be, because many genes are tested in parallel; the p-value means something different in the multiple-testing context.


If the aim of the microarray study is to select a few key genes for more precise study by qt-PCR or other methods, then we may want simply to rank the genes in order.  Then the goal of the statistical analysis is to provide the experimenter with an ordered list of good candidate genes to follow up, where a majority (say more than half), are really different (true positives). Another way to say this is that the expected number of false positives is some manageable fraction (say less than .5) of the genes selected.  Here statistical significance is a guide to better screening. This question leads us to specify the false discovery rate (FDR), rather than significance level (p-value).


What Tests are Appropriate to Microarray Data?

Gene expression data is not ‘reasonable data’; outliers and systematic biases are common. Many kinds of statistical tests work well with reasonably distributed  (not too far from normal) data, and but give falsely significant p-values when the distribution of values has frequent extreme values. The p-values from a t-test are within 5% most of the time under a skewed distribution, but often give erroneously small or large p-values when there are more than 10% 2-fold outliers, which is not unusual with microarray data. Furthermore errors in microarray data are correlated; very few statistical tests give accurate results in this case. What kinds of tests are reliable under these circumstances?

One approach is to use non-parametric tests throughout. Non-parametric tests deliver conservative p-values for all kinds of distributions, and their p-values are generally insensitive to outliers. However they are less likely than parametric tests to pick up regulated genes. If there is a single approach that is widely applicable and copes with all these problems it is permutation testing, which is also simple and easy to program.

To do a permutation test, you

i)                    Choose a test statistic, eg. a t-score for a comparison of two groups,

ii)                   Compute the test statistic for all the genes,

iii)                 Permute the labels on samples at random, and re-compute the test statistic for the rearranged labels. Repeat for at least 1,000 permutations, and finally

iv)                 Compare the test statistic from ii) to this distribution.

A common test statistic for comparing treatment vs. control is the t-statistic.


where SDi is the (non-pooled) within-groups standard error for gene i:


Some authors use the following test statistic s rather than the usual t statistic. The denominator is larger, which eliminates genes that change only a little.


where qa is the a-th quantile of the distribution of standard errors for all genes; usually a is .8 or .9. The significance levels of this distribution must be computed by a permutation method.


What is a P-Value?

Most scientific papers quote p-values, however few papers discuss their meaning. Let’s consider, for example, a t-test of differences between two samples. If there is no systematic (real, reproducible) difference between groups, nevertheless the t-score for differences between groups is never exactly 0. Common sense cannot decide whether a particular value provides strong evidence for a real difference. We frame the question: how often a random sampling of a single group would produce a t value as far from 0 as the t we observed.

We begin by supposing for the sake of argument that there is no real difference. If one decides that a difference occurs, whenever a test statistic passes a certain threshold, then the p-value is the answer to the question: how often would the numbers deceive us? How often would random sampling from a single group give test statistics as extreme as observed? When you decide an effect is significant at 5%, you say you are willing to be wrong once in twenty decisions. (We don’t often cross the street on a 95% confidence that there is a break in traffic.)


Multiple Testing P-Values and False Positives

Suppose you compare two groups of samples drawn from the same larger group, using a chip with 10,000 genes on it. On average 500 genes will appear ‘significantly different’ at a 5% threshold. For these genes, the variation between samples will be large relative to the variation within groups due to random, but uneven allocation of the expression values to the treatment and control groups. Therefore the p-value appropriate to a single test situation is inappropriate to presenting evidence for a set of changed genes.


Statisticians have devised several procedures for adjusting p-values to correct for the multiple comparisons problem. The oldest is the Bonferroni correction; this is available as an option in many microarray software packages. The corrected p-value, pi* for gene i is set to:

pi* = Npi, if Npi < 1,

1, if Npi > 1.

This is correct, but too conservative. In practice, few genes meet this strict criterion, including many known to be differentially expressed from other work.

Another procedure is the Sidak correction:

pi* = 1 – (1 – pi)N.

This is exactly correct if all genes are independent, however it is still too conservative if test statistics are correlated (ie. genes are co-regulated). The Bonferroni correction is a conservative approximation to the Sidak: expanding 1 – (1 – pi)N = 1 – (1 – Npi + …) gives Bonferroni.


To give some idea of how to approach the problem in the realistic case when genes are correlated, imagine an extreme case: all genes are perfectly correlated. In that case all tests are identical, and p-values for one are p-values for all; no correction for multiple-comparisons is needed. In reality typical gene data is highly correlated: one factor may account for as much as half the variance. The multiple testing correction for correlated data should be weaker than for independent genes, while stronger than that for identical data. The number of extreme test statistics, and therefore extreme p-values, for correlated genes will be more variable than for independent genes, although it will have the same long-run average.

This is illustrated below


Figure. P-values from genes under null hypothesis, under various degrees of correlation


Over many random selections the average number of genes exceeding the .05 threshold will be 5%. If genes are independent, then (roughly) 5% of genes exceed the .05 threshold, all of the time. If genes are perfectly correlated, then no genes exceed the .05 threshold, 95% of the time, and all genes exceed the .05 threshold 5% of the time. In a realistically correlated situation, fewer genes exceed the .05 threshold, most of the time; for example, 2% of genes exceed the .05 threshold, 90% of the time, and 40% of the genes exceed the .05 threshold, 10% of the time. In that case the corrected p-value when 2% of the genes exceed the .05 threshold should be 10%.

This gives us an approach to correcting for multiple testing: for a group of genes which appear to differ between sample types, we ask how often would a group this size exceed the threshold that these genes exceed? To be precise: for a specific number k and a threshold a, how often will random sampling from the same group give at least k single test p-values will fall under the threshold for significance level a?

Calculating Permutation-Based P-values

To calculate corrected p-values, first calculate single-step p-values for all genes: p1, …, pN. Then pick a set of m genes, which appear interesting to you; order the smallest m p-values: p(1), …, p(m), from least to greatest. Next permute the sample labels at random, and compute the test statistic between (randomized) groups. For each k < m keep track of how often you get k p-values less than p(k). After all permutations, for each value of k compute the fraction of permutations with k p-values less than p(k). This is the corrected p-value. This procedure is more powerful than the other corrections, in that it gives a bigger list of significant genes at any specified risk of false positives.


This is fairly straightforward for the t-test, since the significance level of the t-score is known, for some other statistics, such as the s-statistic mentioned above, the significance levels themselves need to be computed by permutations. A fast method for this has been devised recently by Yongchao Ge. It is implemented in the Bioconductor package multtest. See Ge, et al, Test, 2003.