Today we will be speaking of proteomics data files and their contents. We will also talk a bit about proteomics file formats, and why they are a mess.
In a typical case, you would have been running peptide samples on a mass-spectrometer using a gradient to separate peptides by some physicochemical property so they don’t rush at the mass-spectrometer’s detectors all at the same time
Actually, let’s start by looking inside an mzXML file; one of the two main types of open source mass-spec data formats:
As we can see, the file is encoded as XML (hence the name mzXML). After a short header, it lists thousands of “scans”, each made up of annotations defining the type of scan, followed by a list of binary to base64-encoded peaks. Above, we can see that peak information is encoded, but if we were to decode it we would get a series of M/Z and intensity pairs, i.e. all that we need to define peaks. If we were to look at the end of this particular file, we would also find an optional index, corresponding to the “byte offsets of each scan in the instance document”.
Not all of the potential scan annotations are shown here, and not all of those shown are relevant for us now. Important to know is, that your results file lists a series of thousands of scans, which can each be a specific MS “level” (most of the time MS1 or 2, in some methods MS3 is also present). Each scan is defined by an M/Z range and is made of a series of peaks.
Note: one of the attributes is called “centroided”: MS data can either be acquired in profile mode or centroid mode. In profile mode, each peak is represented by many data points as real peaks when detected are actually quite wide in the M/Z dimension; because we only need actual peak M/Z and intensity, and in order to reduce data size (more than 10 fold!), most of the time the data is “centroided”, i.e. each peak is only represented by its integrated intensity and M/Z (estimated as the centroid of the real peak).
In order to explain the notion MS levels, we need to discuss the instrument’s duty cycle. As mentioned above, the samples are separated and sprayed into the instrument by the LC over a long gradient. During this time, the instrument cycles through a cycle defined in the method, which will in general conform to the following structure:
Also called full scan, precursor scan or survey scan. The instrument is letting all peptides in the experiment’s wide isolation window hit the detector, so essentially everything that can be detected and is in the expected peptide M/Z range is detected here. At this stage, usually it is important to use the highest M/Z resolution available in order to be able to obtain a very precise M/Z value for each precursor (so, if available, use an Orbitrap over a Linear Ion Trap; see glossary below).
These are also known as MS/MS or fragment scans. Indeed, in an MS2 scan the ions detected are the products of the fragmentation of either an isolated precursor (DDA) or a complex mixture of precursors (DIA):
Since fragmentation requires to spend time to accumulate precursors, N is chosen to balance number of peptide identifications with good MS1 coverage. The precursors to isolate and fragment are chosen based on the following principles:
[1] Thomson: the M/Z unit. Equal to 1 amu (unified atomic mass unit, also called Dalton) divided by peptide charge.
MS3 scans are only involved in some setups. The example relevant to us right now is when the peptides being analysed have been labelled with Isobaric Tags (TMT or iTRAQ). After fragmentation, the labels fall off and can be quantified relatively to each other. This labelling method is only compatible with a DDA setup, as in order to quantify each labelled channel relatively to the others a single precursor has to be analysed at a time[1].
It is possible to analyse TMT or iTRAQ samples using a simple MS/MS setup. However, an issue arises because the isolation window used to isolate precursors for MS2 cannot be too narrow, or else too much precursor will be lost because of border effects. This means that very frequently several precursors are co-isolated for fragmentation. While this often still allows for identification (see: Database Searching), the quantitative data will be low quality because the labels it will be contaminated with labels found on contaminating peptides. This phenomenon is called Ratios Compression.
In order to address this issue, a method called MultiNotch MS3 relies on a Fusion instrument’s ability to perform synchronous precursor selection. The idea is that a precursor is isolated, fragmented at medium energy (high enough to fragment it but low enough that most isobaric labels will not break off), then several fragments (up to 15 at a time, though the recommended value is 5) are co-isolated and fragmented at higher energy this time to release isobaric labels. The fact that the labels are broken off from re-isolated MS2 fragments means that there is a second step of filtering that greatly reduces the issue.
The relationship between MS1, MS2 and MS3 scans is illustrated below:
[1] Since ratios are expected to vary for different precursors. That is in fact sort of the whole point.
So there you have it folks, those large files that are produced by the MS every time you run a sample on them mainly contain these MS1, MS2 and sometimes MS3 spectra. We will discuss in future entries how this data is actually interpreted and turned into protein group- and peptide-level expression matrices. I would just like to conclude by saying that, sadly, every maker of MS instruments has their own in-house, often proprietary format for MS files. A useful resource for MS file formats is this Wikipedia page.
I especially like the bit that says that the .RAW formats of different makers are actually not interchangeable. This is pure genius.
Luckily, most of these formats can be converted to open formats, such as .mzXML or .mzML. MaxQuant, the software I use for most MS searches, works with Thermo .RAW files or with .mzXML. Most of the tools that can be used to analyse MS data are designed to work with .mzXML or .mzML.
We thought we should quickly explain here a few things about the parts of a mass-spectrometer that we mentioned in this post.
In general, any set of electrodes used to either guide, focus, confine, filter or isolate ions in a mass spectrometer is an ion “optic”. Here, we will need to discuss the following optics:
An Orbitrap is usually coupled with a C-Trap (Curved Linear Trap), which can quickly accumulate and bunch up packets of ions.
But maybe this will all be clearer if you see these parts in action in this beautiful promotional video by Thermo of one of their Fusion Instruments in glorious action: