De Novo Antibody Protein Sequencing with Mass Spectrometry
There are many situations where one would need to sequence a protein (reading out the amino acid sequence). However, sequencing proteins using traditional techniques is exceedingly hard and a slow process. Here we discuss some of the progress made by scientists towards high-throughput protein sequencing particularly antibody sequencing with mass spectrometry. We first examine the sequencing of shorter peptides (up to 50 amino acids). And then discuss the ways to assemble the shorter peptides into longer proteins.
Antibody Sequencing with Mass Spectrometry
Antibody Sequencing is an excellent application of de novo protein sequencing using mass spectrometry. It is particularly useful when the DNA sequence is not available – the antibody sequence can be determined directly from the antibody sample. Once the antibody’s amino acid sequence is known, one can back-translate it into a gene and start producing the same antibody in large quantity using recombinant bacteria.
Edman sequencing was invented half century ago and is still being used today in some occasions to sequence short peptides. Edman sequencing is a cyclical process that repeatedly removes the first amino acid at the N-terminal of the peptide, and detects the identity of the removed amino acid. However, the error accumulation practically prohibits this cyclical process from going for ever. As a result, Edman sequencing can only be used to sequence peptides with less than 50 residues (and in practice, under 30). Besides the length constraint, the main drawbacks of this method are
- It does not work for N-terminally blocked peptides, and
- Very low throughput: a cycle time of 40 minutes per amino acid is the low throughput.
De Novo Peptide Sequencing with Tandem Mass Spectrometry
Another compelling technology to sequence short peptides (up to 50 residues) is through tandem mass spectrometry (MS/MS). A mass spectrometer is an instrument that measures the mass-to-charge ratio of ions (charged molecules). To produce an MS/MS spectrum, many copies of the same peptide ion are fragmented into pieces, resulting into different fragment ions. The mass-to-charge ratios are then measured and a MS/MS spectrum is produced. Since most amino acid residues have different mass values, different peptides would produce fragment ions with different mass-to-charge ratios, and therefore different spectra. Thus, it is theoretically possible to figure out the amino acid sequence of the peptide from its MS/MS spectrum. This process is called de novo peptide sequencing.
Figure 1. An annotated tandem mass spectrum for a peptide. The colors and labels of the peaks are added after knowing the amino acid sequence EVHTAVTQPR.
De novo sequencing with MS/MS avoids the two main drawbacks of Edman sequencing. It is now high throughput and works for peptides whether N-terminal is blocked or not. A typical LC-MS/MS setting can generate tens of thousands of MS/MS spectra per hour. This makes it feasible to use mass spectrometry to conduct a comprehensive study of the whole proteome. However, although theoretically possible, de novo sequencing with MS/MS is practically difficult and has a high error rate.
In the past two decades, de novo sequencing as a research problem has attracted a handful of the most renowned bioinformaticians in the world. Dozens of software tools have been published in the past. A review article on this topic can be found at reference . Recently, the newly published Novor software  has made a breakthrough in peptide de novo sequencing. By machine learning on NIST’s peptide spectrum library, Novor trained a large decision tree that can accurately measure the correctness probability of a certain amino acid appearing on a certain position of the spectrum. With this scoring function, Novor was then able to accurately tell which amino acid sequence is the most likely sequence of the peptide for the given spectrum. With an advanced algorithm, Novor achieves a de novo sequencing speed of 300 spectra per second on a MacBook Pro laptop. This is more than 10 times faster than the second fastest program out there — at a significantly improved accuracy at the same time.
General Procedure for Protein Sequencing
Protein sequencing is generally achieved in a bottom-up manner by the following steps:
- Enzymatically digest the protein into shorter peptides.
- Sequence each shorter peptide.
- Assemble the peptides sequences back to the long protein sequence.
Figure 2 illustrates the procedure when step 2 (peptide sequencing) is carried out with mass spectrometry.
Figure 2. The general procedure of a bottom-up protein sequencing with LC-MS/MS.
Challenges and Solutions for Protein Sequencing with Mass Spectrometry
The concept of this overall procedure for protein sequencing (and antibody sequencing) with mass spectrometry is not new and has been used for quite some time (see e.g. ). This conceptually simple procedure however has a lot of challenges in the implementation details, especially when one wants to use the procedure to routinely sequence all the amino acids of a protein in a high-throughput manner.
First, two peptides from different parts of the protein or other contaminants may have a small overlap merely by chance. Thus, to facilitate the assembly in step 3, the peptides produced in step 1 should have long overlapping regions to eliminate random matches. This can be achieved by carefully selecting different enzymes that digest at different sites of the protein, as well as controlling the digestion condition to deliberately introduce missed cleavages. However, the non-standard enzymes produce spectra ill suited to analysis by the standard analysis software used in proteomics. At Rapid Novor we handle this problem by customizing Novor software with algorithm parameters specifically trained for each enzyme and digestion conditions.
Secondly, the de novo sequence of each spectrum may contain sequencing errors. This is a fundamental limitation of the MS/MS technology for de novo protein sequencing. If not dealt with carefully, the errors on the peptides can propagate to the final protein sequencing result. Novor solves this problem nicely with an “amino acid confidence score” that measures the quality of each residue in the peptide sequencing result. Only the highly confident residues are selected for deriving final protein sequencing results. We have made great efforts to refine the amino acid confidence score by training on real world data.
Additionally, the assembly process for de novo protein sequencing is algorithmically difficult so most antibody analysis software tools attempt to avoid the assembly by using a database search or homology search approach only. This type of search continuously modifies a reference antibody sequence, until it converges on the final answer. The database search or homology search for protein sequencing approach works reasonably well when the difference between the reference and the target protein sequence is small. But it usually encounters difficulties or produces an erroneous sequence for the hypervariable regions. In particular, the dependence on a reference sequence creates a bias towards this sequence that may be incorrect. One such example is illustrated in reference . It also prevents the discovery of antibody isoforms when multiple forms exist in the sample.
REmAb™ de novo Protein Sequencing
RemAb™ de novo protein sequencing provides a unique approach for antibody sequencing with mass spectrometry. We have developed an advanced and proprietary algorithm for protein de novo sequencing with an emphasis on sequencing the CDR regions of antibodies correctly. Using our approach, the general structure of an antibody is used as a framework to correctly position each de novo peptide sequence in the right place. However, the derivation of the final sequence is based on de novo sequencing and not the reference sequence. This ensures the quality of the final result, especially for the highly variable CDR regions.
Conclusion – de novo Antibody Protein Sequencing
Although the principle of protein sequencing is easy to understand, it requires significant expertise in both the mass spectrometry protocols and informatics data analysis to achieve the correct result. It is not so difficult to get up to 95% coverage of the protein sequence — especially when one has a similar protein to start with. However, the remaining 5% is usually highly variable, and therefore can rarely be determined by simple local alignment based on known protein sequences. Yet, the unsequenced hypervariable regions are usually the most interesting (for example, CDR3 of the antibody heavy chain). Software adapted from whole proteomics approach can underperform in these areas. Good data analysis starts with good raw data generation, therefore a laboratory specializing in de novo protein sequencing using mass spectrometry and specifically de novo antibody sequencing using high resolution mass spectrometry instrumentation is also crucial.
 B. Ma and R. Johnson. De Novo Sequencing and Homology Searching. Molecular & Cellular Proteomics 11. DOI 10.1074/mcp.O111.014902. 2012.
 B. Ma. Novor: Real-Time Peptide de Novo Sequencing Software. Journal of ASMS 26:1885-94. 2015.
 S. Hopper et al. Glutaredoxin from rabbit bone marrow. Purification, characterization, and amino acid sequence determined by tandem mass spectrometry. J. Biol. Chem., 264, 20438–20447. 1989.
 P. Taylor et al. In-Depth Characterization of Monoclonal Antibodies with a Single Experiment and Fully Automated Data Analysis. ASMS poster MP018. 2016. [/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]