What a tangled web we unweave – deciphering raw proteomics data

' Armel Nicolas

Today we will be speaking of proteomics data files and their contents. We will also talk a bit about proteomics file formats, and why they are a mess.

In a typical case, you would have been running peptide samples on a mass-spectrometer using a gradient to separate peptides by some physicochemical property so they don’t rush at the mass-spectrometer’s detectors all at the same time

Actually, let’s start by looking inside an mzXML file; one of the two main types of open source mass-spec data formats:

Proteomics mzXML file


As we can see, the file is encoded as XML (hence the name mzXML). After a short header, it lists thousands of “scans”, each made up of annotations defining the type of scan, followed by a list of binary to base64-encoded peaks. Above, we can see that peak information is encoded, but if we were to decode it we would get a series of M/Z and intensity pairs, i.e. all that we need to define peaks. If we were to look at the end of this particular file, we would also find an optional index, corresponding to the “byte offsets of each scan in the instance document”.

Not all of the potential scan annotations are shown here, and not all of those shown are relevant for us now. Important to know is, that your results file lists a series of thousands of scans, which can each be a specific MS “level” (most of the time MS1 or 2, in some methods MS3 is also present). Each scan is defined by an M/Z range and is made of a series of peaks.

Note: one of the attributes is called “centroided”: MS data can either be acquired in profile mode or centroid mode. In profile mode, each peak is represented by many data points as real peaks when detected are actually quite wide in the M/Z dimension; because we only need actual peak M/Z and intensity, and in order to reduce data size (more than 10 fold!), most of the time the data is “centroided”, i.e. each peak is only represented by its integrated intensity and M/Z (estimated as the centroid of the real peak).


In order to explain the notion MS levels, we need to discuss the instrument’s duty cycle. As mentioned above, the samples are separated and sprayed into the instrument by the LC over a long gradient. During this time, the instrument cycles through a cycle defined in the method, which will in general conform to the following structure:

Typical MS duty cycle (~1 to 3s depending on method):

  • One MS1 level scan:

Also called full scan, precursor scan or survey scan. The instrument is letting all peptides in the experiment’s wide isolation window hit the detector, so essentially everything that can be detected and is in the expected peptide M/Z range is detected here. At this stage, usually it is important to use the highest M/Z resolution available in order to be able to obtain a very precise M/Z value for each precursor (so, if available, use an Orbitrap over a Linear Ion Trap; see glossary below).


  • Several MS2 level scans:

These are also known as MS/MS or fragment scans. Indeed, in an MS2 scan the ions detected are the products of the fragmentation of either an isolated precursor (DDA) or a complex mixture of precursors (DIA):

  • In Data Dependent Acquisition (DDA) mode, after each MS1 the instrument will immediately select a number N (specified in the method, usually between 10 and 15 per cycle) of precursors to fragment. Precursors will be sequentially accumulated by setting each time the Quadrupole to exclude all species outside of a narrow (1.4 to 2 Th[1]) isolation window. In a Fusion instrument, fragments are analysed in the Linear Ion Trap as this can be done in parallel while the survey scan is happening (you do not need the same resolution for MS2 spectra).

Since fragmentation requires to spend time to accumulate precursors, N is chosen to balance number of peptide identifications with good MS1 coverage. The precursors to isolate and fragment are chosen based on the following principles:

  • The most intense (~ abundant) precursors are fragmented for each duty cycle, unless…
  • … they have already been fragmented recently. Because peptides typically elute over several scans (median retention length is roughly 30s for a 2h gradient, each fragmented precursor M/Z is immediately added to a dynamic exclusion list for a duration of usually 40s to avoid constantly re-fragmenting the same most abundant species.
  • In Data Independent Acquisition (DIA) mode, the instrument will instead follow each MS1 scan by either a single MS2 where all current precursors are fragmented together, or (in order to reduce complexity) by a series of MS2 scans, each with a wide isolation window; together, these windows will cover the whole M/Z range of interest.

[1] Thomson: the M/Z unit. Equal to 1 amu (unified atomic mass unit, also called Dalton) divided by peptide charge.

  • Optional: one MS3 level scan per MS2 scan

MS3 scans are only involved in some setups. The example relevant to us right now is when the peptides being analysed have been labelled with Isobaric Tags (TMT or iTRAQ). After fragmentation, the labels fall off and can be quantified relatively to each other. This labelling method is only compatible with a DDA setup, as in order to quantify each labelled channel relatively to the others a single precursor has to be analysed at a time[1].

It is possible to analyse TMT or iTRAQ samples using a simple MS/MS setup. However, an issue arises because the isolation window used to isolate precursors for MS2 cannot be too narrow, or else too much precursor will be lost because of border effects. This means that very frequently several precursors are co-isolated for fragmentation. While this often still allows for identification (see: Database Searching), the quantitative data will be low quality because the labels it will be contaminated with labels found on contaminating peptides. This phenomenon is called Ratios Compression.

In order to address this issue, a method called MultiNotch MS3 relies on a Fusion instrument’s ability to perform synchronous precursor selection. The idea is that a precursor is isolated, fragmented at medium energy (high enough to fragment it but low enough that most isobaric labels will not break off), then several fragments (up to 15 at a time, though the recommended value is 5) are co-isolated and fragmented at higher energy this time to release isobaric labels. The fact that the labels are broken off from re-isolated MS2 fragments means that there is a second step of filtering that greatly reduces the issue.


The relationship between MS1, MS2 and MS3 scans is illustrated below:

[1] Since ratios are expected to vary for different precursors. That is in fact sort of the whole point.


The relationship between MS1, MS2 and MS3 scans



So there you have it folks, those large files that are produced by the MS every time you run a sample on them mainly contain these MS1, MS2 and sometimes MS3 spectra. We will discuss in future entries how this data is actually interpreted and turned into protein group- and peptide-level expression matrices. I would just like to conclude by saying that, sadly, every maker of MS instruments has their own in-house, often proprietary format for MS files. A useful resource for MS file formats is this Wikipedia page.

I especially like the bit that says that the .RAW formats of different makers are actually not interchangeable. This is pure genius.

Luckily, most of these formats can be converted to open formats, such as .mzXML or .mzML. MaxQuant, the software I use for most MS searches, works with Thermo .RAW files or with .mzXML. Most of the tools that can be used to analyse MS data are designed to work with .mzXML or .mzML.

Appendix: small Glossary

We thought we should quickly explain here a few things about the parts of a mass-spectrometer that we mentioned in this post.

In general, any set of electrodes used to either guide, focus, confine, filter or isolate ions in a mass spectrometer is an ion “optic”. Here, we will need to discuss the following optics:

  • A Quadrupole in the restricted sense is an ion guide and M/Z filter. In general, many ion optics, such as Linear Ion Traps, are based on the quadrupole structure.
  • A “Trap” in general is any ion optic used to accumulate ions. The most common ones are Linear Traps. Not to be confused with a “Trap Column” used in some LC setups.
  • An Orbitrap is a type of high resolution mass analyser on Thermo Instruments. It is a very complex electrode with an outer shell and an inner spindle shaped electrode. Peptides shot into an Orbitrap orbit the spindle-shaped electrode and each specific M/Z will form a ring around it, which will oscillate along the axis of the electrode, generating an induced current. After Fourier transform, the period of the oscillations of each of the components of the induced current can be used to measure at very high resolution the M/Z of each of the species present in the analyser.

An Orbitrap is usually coupled with a C-Trap (Curved Linear Trap), which can quickly accumulate and bunch up packets of ions.

  • A Linear Ion Trap is both a fragmentation chamber (sometimes offering several dissociation modes, such as CID or ETD) and a different type of mass analyser, with lower resolution than an Orbitrap. The mass-analysing part works by varying the ion confining voltage, so that ions become destabilised at a voltage which is a function of their M/Z and will escape the central confinement space to hit the sides of the detector, where an Ion Multiplicator converts them into current.

But maybe this will all be clearer if you see these parts in action in this beautiful promotional video by Thermo of one of their Fusion Instruments in glorious action:




Point of View
Related Posts: