Where to draw the line? Hypothesis Testing in Proteomics

' Dr. Armel Nicolas

So, you’ve got this beautiful proteomics experiment done. You just got your data (yay!), let’s say ratios of treated vs control, 3 replicates, simple experiment. But that’s thousands of protein groups, most of which are probably not interesting at all. We want to find out which proteins are regulated. How do we do it?


Hypothesis testing

I assume, of course, that you know about the basics of hypothesis testing. The general idea is that we have a Null hypothesis H0, which is essentially that nothing is happening, i.e. your treatment might as well have been water. We want to calculate a test statistic on the data and compare it with its known distribution under H0. Based on this, we can calculate a P-value: the probability to randomly observe a result at least as extreme as the one observed under H0 ( = if our treatment is bogus). We are ready to accept a specific rate of error type I[1], usually 1 or 5% in biology[2], and we will consider a test successful if its P-value is lower than that error rate.

As an aside, I feel like I should make a confession here. Although as a student I had always been good at maths, it took me years to wrap my mind around these simple concepts. And to understand how critical said understanding was to me correctly analysing my experiments. Of course, some of the blame lies with me. Still, I cannot help but feel that the extreme reluctance – some might say, loathing –with which most maths teachers and mathematicians touch statistics must have something to do with how poor we are at imparting statistical acumen to our students.

Anyhow… So these are the basic concepts. Now, how do they particularly apply to our Proteomics data analysis?

Typically, in science, a lot of data is normal, i.e. it follows a Gaussian distribution. One of the main reasons for this is the incredibly powerful Central Limit Theorem, which states that the sum of random variables, regardless of their original distribution, tends towards a normal distribution.

Assumptions of normality are critical for many statistical methods. Yet not all data looks roughly normal, and even for bell-shaped data, it cannot always be modelled or even reasonably approximated by a normal distribution.

Proteomics data graph

Happiness made stats


[1] For reference, type I error means we decide the protein is regulated when it isn’t; type II error means we decide the protein is not regulated when it is. Usually the former is considered worse than the latter.

[2] I would be very much obliged if the two physicists at the back of the class could stop laughing hysterically. Thank you.


The distribution of Proteomics data

So, is Proteomics data normal? Well, the short answer is no (the “true” answer is no, since it is in linear scale always positive), and the long answer is sort-of-ish. Depending on dataset, we get more or less good approximations of normality for both expression and ratios. Usually, better in log scale than in linear scale. Sometimes, the deviations are quite significant. In fact, a case has been made that the Cauchy distribution is better than the normal distribution at modelling log ratios[1]. However, modelling proteomics ratios after a Cauchy distribution does not always work well, and depending on dataset (even for the same type of data, e.g. TMT ratios, processed with the same algorithms) the data can look closer to a Gaussian or a Cauchy distribution.

Fitting statistical distributions to log2 proteomics ratios

Example dataset where log2 Ratios (black) are better modelled by the normal (red) than the Cauchy (orange) distribution.

So the data is not exactly normal, and often the deviation is quite significant. Is this bad? Well, not ideal. Without the assumption of normality, a lot of statistical methods stop working. Things start becoming strange…

Douglas Adams quote

Obligatory Hitchhiker’s Guide to the Galaxy Quote

For instance, if you have replicates – and I hope you do – and want to test for significance under the Null hypothesis, you would use Student’s T-test, which assumes normality. Now, it’s actually not so bad, because in practice the T-test is actually very robust to deviations from normality. In addition, for large datasets (more than ~30 observations, so most proteomics datasets), non-normality ceases to become an issue. Still, this is something you have to be aware of: the T-test may not be the optimal solution for testing significance.

What solutions are there?

  • You could use a non-parametric test, such as a permutations test. In my hands, the power is usually lower than with Student’s T-test, probably because a) as said before, the latter is pretty robust, and b) the number of replicates in most experiments is too low for resampling to come into its own. However, if you have a lot of replicates, you may get good results with this.
  • Alternatively, you could at least try to use an “improved T-test”, something you may want to do regardless of the issue with normality. The improved Moderated T-test calculates variance based on all observed values, not just a single protein group. Although it was originally developed for Transcriptomics datasets, it works equally well in proteomics. It has become the standard method for us.

[1] Example: here

Multiple hypothesis testing

A second issue is that of multiple hypothesis testing. Each P-value is the probability of observing a result “at least as extreme as the one observed” for that particular protein group, under H0. Let us say that we decide that P-values below 1% are significant. So if we test 100 proteins that do not respond to our treatment, we should expect about 1 protein group with a “significant” P-value. This is, usually, roughly the proportion we find: most proteins are not “significant”, but a handful are. So how do we know these are not just random effects?

This issue is called the Multiple Hypothesis Testing problem, and there are a range of solutions:

  • The Family Wise Error Rate (FWER) is the probability that the list of significantly regulated proteins includes at least one falsely rejected Null hypothesis. It is thus extremely stringent, and rarely used in proteomics. To control for the FWER, one can use the Bonferroni correction. Essentially, it takes the significance threshold T, but applies it globally. Thus, for T = 0.01 (1%) and N P-values, under that correction a P-value would have to be lower than 0.01/N to be deemed significant.
  • The False Discovery Rate (FDR) approach instead tries to control the proportionof false discoveries in the results. We decide beforehand to accept a proportion α of false discoveries, then calculate a threshold such that the global FDR is expected to be at worst equal to α. The most commonly used correction for this is the Benjamini-Hochberg procedure. Using this, one can calculate new significance levels for a chosen α. Since it is less stringent than the FWER method, FDR correction allows for more discoveries – but at a greater risk. Thus, while it is useful for proteomics – where we want to generate leads – one should bear in mind that validation experiments are important.
Volcano plot with significant thresholds for different FDRs (yellow)

Volcano plot with significant thresholds for different FDRs (yellow)

In a future entry, I hope to discuss how to select a vertical threshold empirically, i.e. what is the smallest ratio/fold change we want to include in the results.


Point of View
Related Posts: