Frederick Sanger, father of DNA sequencing, sequenced the first protein, insulin, before he began his efforts in deciphering nucleotide codes.

Nowadays, DNA sequencing is so popular that it is easy to forget that the first sequenced biological material was protein – insulin, by Sanger [1, 2]. Sanger, and another researcher, Edman, separately pioneered protein sequencing. Edman degradation, also known as Edman sequencing, is still used, though it suffers from many shortcomings in contrast to today’s protein sequencing by Mass Spectrometry (MS). If you would like to learn more about Edman degradation and how it compares to protein sequencing, please read our blog post De novo Antibody Protein Sequencing with Mass Spectrometry.

Thanks to lessons learned during his protein sequencing efforts, Sanger continued to develop his now widely known and used DNA sequencing method [1, 2]. However, while DNA sequences inform researchers on potential cellular players in a given pathway, protein sequencing outputs information on actual effectors in that pathway. Protein sequencing is the process of deriving the amino acid (aa) code of a particular protein using MS [3]; de novo protein sequencing is typically performed without relying on previously established databases. Today, protein sequencing continues to provide significant information, including data that might be unavailable through DNA sequencing such as protein modifications critical for biological functions and processes.

Protein sequencing and antibodies

One of the most useful applications of protein sequencing is that of determining antibody sequences [4, 5]. Arguably, protein sequencing is the most suitable technique currently available for the elucidation of antibody codes. Particularly, de novo protein sequencing is the perfect application to antibody sequencing as it avoids random or bias introduction from homology-based searching of databases. The process of antibody generation involves extensive rearrangements and combinations of plasma cell genes to produce a diverse set of antibody clones (polyclonal antibodies, pAbs). Thus, antibody sequence discovery must be done through de novo sequencing approaches.

Protein sequencing becomes especially relevant as a tool during the process of validation of antibodies used in academic and industrial research. Standard commercial antibody production first begins by challenging animals with a desired antigen (target foreign molecule) to mount an immune response. The production animal’s plasma cells undergo genetic hypermutation and rearrangement events that result in pAbs. Traditional molecular biology techniques are then used to separate individual clones (monoclonal antibodies, mAbs) from the pAb mixture [6]. Variation arises from many factors, with the underlying issue being that challenging the same animal with the same antigen will result in different batches of a mAb, each with different structure and activity. Conducting DNA sequencing of mAbs for sale is often bypassed as it involves lengthy bioinformatic analysis and requires access to the cell line. MS-based protein sequencing can easily confirm the monoclonal antibody code which can then be used to for recombinant antibody expression to ensure reproducibility. MS-based protein sequencing can additionally be performed when an antibody-producing cell line is gone, a production animal has died, or when DNA is not available. In combination with other biophysical and biochemical methodologies, MS can also be used to characterize antibodies’ structure, avidity and other properties. MS-based protein sequencing is therefore a great tool for many aspects of scientific and medical research and development.

Protein sequencing and identification of glycan sites

MS-based protein analysis can also be used to provide additional data unavailable through DNA sequencing. Many post-translation modifications (PTMs) important for protein function are extremely difficult and at times impossible to deduce from DNA sequencing. Extensive bioinformatics are required to predict PTM sites most of the time, such as glycan sites, while MS can confirm them in a week. For instance, glycosylation can occur at sites of asparagine (Asn, N) near serine (Ser, S) or threonine (Thr, T), or at the oxygen atom of the hydroxyl sidechain of Ser or Thr. Glycosylation PTMs can thus be described as N-linked or O-linked. Whereas an aa sequence motif exists for N-linked glycosylation (Asn-X-Ser/Thr, where X is any aa but a proline), there is no DNA/RNA or even protein motif/consensus sequence available for O-linked glycosylation PTMs [7]. Glycan identification in antibodies is particularly important as glycosylation of the antigen binding sites is important for avidity and stability [8, 9, 10].

Protein sequencing as a tool to strengthen research

Despite the breadth of genomic information available, there is little proteomic information to support it. Some genomes have yet to be DNA sequenced and some remain incompletely sequenced or improperly annotated [11]. Even in completely DNA sequenced genomes, such as the human genome, the expected protein products from these genes have yet to be confirmed and this can be done most efficiently via MS [11, 12]. There are 20,000 to 25,000 genes in the human genome, but these genes produce more than the predicted canonical proteins due to alternative splicing and other modifications at the RNA and protein level [13]. The human proteome discovered to date has barely scratched the surface when it comes to accounting for additional proteins; some estimate that the total amount of proteins in our proteome reaches one billion [13]. The discoveries made through MS protein sequencing efforts will have great implications in many aspects of scientific and medical work.

Indeed, protein and DNA sequencing data can be used alongside one another to yield more robust evidence. At the American Society for Mass Spectrometry (ASMS 2019), our group reported data showing protein sequencing can be just as accurate as DNA sequencing and can fill in DNA sequencing gaps [14]. However, protein sequencing, and specifically de novo protein sequencing, has yet to be routinely used for proteome discovery. The most often used approach is protein identification by mass spectrometry that relies on homology searching and comparison using databases built from DNA sequencing information. However, they still offer a glimpse of protein sequencing’s great potential.

Most recently, a News Feature in Nature showcased the importance of MS-based protein identification of ancient human remains [15]. Through the passage of time, DNA may degrade, endure damage (resulting in missing segments), and become scarce. Reconstitution of the entire target genome through DNA sequencing then becomes a painstaking, near impossible task. This is the reason why samples must be as fresh as possible for DNA sequencing. However, proteins are often more stable and stick around longer than DNA. Proteins often remain in teeth and bone, and even if some degradation has occurred, the protein’s building blocks – peptides, will still be available to perform MS-based identification [15].

Such was the case in the study by Chen et al. published in Nature in May of this year [16]. This is the first study to identify ancient hominin with proteins, which marks an exciting milestone in proteomics. In their paper, they utilized already reported DNA data from other hominin and their own MS-based analysis of peptides of an ancient hominin found in the Tibetan Plateau in Baishiya Karst Cave, Xiahe, Gansu, China. Thanks to MS-based analysis, they could conclude that the fossils belonged to a Denisovan, providing evidence that Denisovans likely resided in a larger geographical area than previously thought – well outside the Altai Mountains, where the infamous Siberian Denisova Cave lies [16]. This study sets a precedent that protein sequencing could be used in the analysis of organisms’ fossils where DNA is unavailable or scant.

María Rosales Gerpe, Ph.D
Scientific Writer
Rapid Novor, Inc.

[1]P. Berg, “Fred Sanger: A memorial tribute,” PNAS, vol. 111, no. 3, pp. 883-884, 2014.
[2]A. O. W. Stretton, “The first sequence. Fred Sanger and insulin,” Genetics, vol. 162, no. 2, pp. 527-532, 2002.
[3]S. Nachimuthu and R. Ponnusamy, “Protein sequencing techniques,” in Concepts and Techniques in Genomics and Proteomics, Cambridge, Woodhead Publishing Series in Biomedicine, 2011, pp. 193-201.
[4]P. M. Ladwig, D. R. Barnidge and M. A. Willrich, “Mass Spectrometry Approaches for Identification and Quantitation of Therapeutic Monoclonal Antibodies in the Clinical Laboratory,” Clinical and Vaccine Immunology, vol. 24, no. 5, pp. pii: e00545-16, 2017.
[5]N. Iwamoto, T. Shimada, Y. Umino, C. Aoki, Y. Aoki, T. A. Sato, A. Hamada and H. Nakagama, “Selective detection of complementarity-determining regions of monoclonal antibody by limiting protease access to the substrate: nano-surface and molecular-orientation limited proteolysis,” Analyst, vol. 139, no. 3, pp. 576-80, 2014.
[6]A. Bradbury and A. Plückthun, “Reproducibility: Standardize antibodies used in research,” Nature, vol. 518, no. 7537, pp. 27-29, 2015.
[7]H. J. An, J. W. Froehlick and C. B. Lebrilla, “Determination of Glycosylation Sites and Site-Specific Heterogeneity in Glycoproteins,” Current Opinion in Chemical Biology, vol. 13, no. 4, pp. 421-426, 2009.
[8]F. S. van de Bovenkamp, L. Hafkenscheid, T. Rispens and Y. Rombouts, “The Emerging Importance of IgG Fab Glycosylation in Immunity,” Journal of Immunology, vol. 196, no. 4, pp. 1435-1441, 2016.
[9]K. Zheng, M. Yarmarkovich, C. Bantog, R. Bayer and T. W. Patapoff, “Influence of glycosylation pattern on the molecular properties of monoclonal antibodies,” mAbs, vol. 6, no. 3, pp. 649-658, 2014.
[10]F. S. van de Bovenkamp, . N. I. L. Derksen, P. Ooijevaar-de Heer, K. A. van Schie, S. Kruithof, M. A. Berkowska, E. van der Schoot, H. IJspeert, M. van der Burg, A. Gils, L. Hafkenscheid, R. E. M. Toes, Y. Rombouts, R. Plomp, M. Wuhrer, M. van Ham, G. Vidarsson and T. Rispens, “Adaptive antibody diversification through N-linked glycosylation of the immunoglobulin variable region,” PNAS, vol. 115, no. 8, pp. 1901-1906, 2018.
[11]B. Ma and R. Johnson, “De Novo Sequencing and Homology Searching,” Molecular and Cellular Proteomics, vol. 11, no. 2, p. O111.014902., 2012.
[12]S. Tanner, Z. Shen, J. Ng, L. Florea, R. Guigó, S. P. Briggs and V. Bafna, “Improving gene annotation using peptide mass spectrometry,” Genome Research, vol. 17, pp. 231-239, 2007.
[13]E. A. Ponomarenko, E. V. Poverennaya, E. V. Ilgisonis, M. A. Pyatnitskiy, A. T. Kopylov, V. G. Zgoda, A. V. Lisitsa and A. I. Archakov, “The Size of the Human Proteome: The Width and Depth,” Int J Anal Chem, vol. 2016, p. 7436849, 2016.
[14]Z. McDonald, S. Chow, K. Gorospe, X. Xu, P. Taylor, Q. Liu, Z. Li, Z. Han, T. J. Pugh, S. Trudel and B. Ma, “A Large-Scale Comparison of MS-based Antibody De Novo Protein Sequencing and Targeted DNA Sequencing,” in American Society for Mass Spectrometry (ASMS), Atlanta, 2019.
[15]M. Warren, “Move over, DNA: ancient proteins are starting to reveal humanity’s history,” Nature: News Feature, vol. 570, pp. 433-436 , 2019.
[16]F. Chen, F. Welker, C.-C. Shen, S. E. Bailey, I. Bergmann, S. Davis, H. Xia, H. Wang, R. Fischer, S. E. Freidline, T.-L. Yu, M. M. Skinner, S. Stelzer, G. Dong, Q. Fu, G. Dong, J. Wang, D. Zhang and J.-J. Hublin, “A late Middle Pleistocene Denisovan mandible from the Tibetan Plateau,” Nature, vol. 569, p. 409–412, 2019.