So, you’ve got this beautiful proteomics experiment done. You just got your data (yay!), let’s say ratios of treated vs control, 3 replicates, simple experiment. But that’s thousands of protein groups, most of which are probably not interesting at all. We want to find out which proteins are regulated. How do we do it?
I assume, of course, that you know about the basics of hypothesis testing. The general idea is that we have a Null hypothesis H0, which is essentially that nothing is happening, i.e. your treatment might as well have been water. We want to calculate a test statistic on the data and compare it with its known distribution under H0. Based on this, we can calculate a P-value: the probability to randomly observe a result at least as extreme as the one observed under H0 ( = if our treatment is bogus). We are ready to accept a specific rate of error type I[1], usually 1 or 5% in biology[2], and we will consider a test successful if its P-value is lower than that error rate.
As an aside, I feel like I should make a confession here. Although as a student I had always been good at maths, it took me years to wrap my mind around these simple concepts. And to understand how critical said understanding was to me correctly analysing my experiments. Of course, some of the blame lies with me. Still, I cannot help but feel that the extreme reluctance – some might say, loathing –with which most maths teachers and mathematicians touch statistics must have something to do with how poor we are at imparting statistical acumen to our students.
Anyhow… So these are the basic concepts. Now, how do they particularly apply to our Proteomics data analysis?
Typically, in science, a lot of data is normal, i.e. it follows a Gaussian distribution. One of the main reasons for this is the incredibly powerful Central Limit Theorem, which states that the sum of random variables, regardless of their original distribution, tends towards a normal distribution.
Assumptions of normality are critical for many statistical methods. Yet not all data looks roughly normal, and even for bell-shaped data, it cannot always be modelled or even reasonably approximated by a normal distribution.
[1] For reference, type I error means we decide the protein is regulated when it isn’t; type II error means we decide the protein is not regulated when it is. Usually the former is considered worse than the latter.
[2] I would be very much obliged if the two physicists at the back of the class could stop laughing hysterically. Thank you.
So, is Proteomics data normal? Well, the short answer is no (the “true” answer is no, since it is in linear scale always positive), and the long answer is sort-of-ish. Depending on dataset, we get more or less good approximations of normality for both expression and ratios. Usually, better in log scale than in linear scale. Sometimes, the deviations are quite significant. In fact, a case has been made that the Cauchy distribution is better than the normal distribution at modelling log ratios[1]. However, modelling proteomics ratios after a Cauchy distribution does not always work well, and depending on dataset (even for the same type of data, e.g. TMT ratios, processed with the same algorithms) the data can look closer to a Gaussian or a Cauchy distribution.
Example dataset where log2 Ratios (black) are better modelled by the normal (red) than the Cauchy (orange) distribution.
So the data is not exactly normal, and often the deviation is quite significant. Is this bad? Well, not ideal. Without the assumption of normality, a lot of statistical methods stop working. Things start becoming strange…
For instance, if you have replicates – and I hope you do – and want to test for significance under the Null hypothesis, you would use Student’s T-test, which assumes normality. Now, it’s actually not so bad, because in practice the T-test is actually very robust to deviations from normality. In addition, for large datasets (more than ~30 observations, so most proteomics datasets), non-normality ceases to become an issue. Still, this is something you have to be aware of: the T-test may not be the optimal solution for testing significance.
What solutions are there?
[1] Example: here
A second issue is that of multiple hypothesis testing. Each P-value is the probability of observing a result “at least as extreme as the one observed” for that particular protein group, under H0. Let us say that we decide that P-values below 1% are significant. So if we test 100 proteins that do not respond to our treatment, we should expect about 1 protein group with a “significant” P-value. This is, usually, roughly the proportion we find: most proteins are not “significant”, but a handful are. So how do we know these are not just random effects?
This issue is called the Multiple Hypothesis Testing problem, and there are a range of solutions:
In a future entry, I hope to discuss how to select a vertical threshold empirically, i.e. what is the smallest ratio/fold change we want to include in the results.