Written by María Rosales Gerpe, PhD

August 2, 2019

A brief history of protein sequencing

Frederick Sanger, father of DNA sequencing, sequenced the first protein: insulin. text

Nowadays, DNA sequencing is so popular that it is easy to forget that the first sequenced biological material was protein – insulin, by Sanger [1, 2]. Sanger, and another researcher, Edman, separately pioneered protein sequencing. Edman degradation, also known as Edman sequencing, is still used, though it suffers from many shortcomings in contrast to today’s protein sequencing by Mass Spectrometry (MS). If you would like to learn more about Edman degradation and how it compares to protein sequencing, please read our blog post De novo Antibody Protein Sequencing with Mass Spectrometry.

Thanks to lessons learned during his protein sequencing efforts, Sanger continued to develop his now widely known and used DNA sequencing method [1, 2]. However, while DNA sequences inform researchers on potential cellular players in a given pathway, protein sequencing outputs information on actual effectors in that pathway. Protein sequencing is the process of deriving the amino acid (aa) code of a particular protein using MS [3]; de novo protein sequencing is typically performed without relying on previously established databases. Today, protein sequencing continues to provide significant information, including data that might be unavailable through DNA sequencing such as protein modifications critical for biological functions and processes.

Protein sequencing and antibodies

One of the most useful applications of protein sequencing is that of determining antibody sequences [4, 5]. Arguably, protein sequencing is the most suitable technique currently available for the elucidation of antibody codes. Particularly, de novo protein sequencing is the perfect application to antibody sequencing as it avoids random or bias introduction from homology-based searching of databases. The process of antibody generation involves extensive rearrangements and combinations of plasma cell genes to produce a diverse set of antibody clones (polyclonal antibodies, pAbs). Thus, antibody sequence discovery must be done through de novo sequencing approaches.

Protein sequencing becomes especially relevant as a tool during the process of validation of antibodies used in academic and industrial research. Standard commercial antibody production first begins by challenging animals with a desired antigen (target foreign molecule) to mount an immune response. The production animal’s plasma cells undergo genetic hypermutation and rearrangement events that result in pAbs. Traditional molecular biology techniques are then used to separate individual clones (monoclonal antibodies, mAbs) from the pAb mixture [6]. Variation arises from many factors, with the underlying issue being that challenging the same animal with the same antigen will result in different batches of a mAb, each with different structure and activity. Conducting DNA sequencing of mAbs for sale is often bypassed as it involves lengthy bioinformatic analysis and requires access to the cell line. MS-based protein sequencing can easily confirm the monoclonal antibody code which can then be used to for recombinant antibody expression to ensure reproducibility. MS-based protein sequencing can additionally be performed when an antibody-producing cell line is gone, a production animal has died, or when DNA is not available. In combination with other biophysical and biochemical methodologies, MS can also be used to characterize antibodies’ structure, avidity and other properties. MS-based protein sequencing is therefore a great tool for many aspects of scientific and medical research and development.

Protein sequencing and identification of glycan sites

MS-based protein analysis can also be used to provide additional data unavailable through DNA sequencing. Many post-translation modifications (PTMs) important for protein function are extremely difficult and at times impossible to deduce from DNA sequencing. Extensive bioinformatics are required to predict PTM sites most of the time, such as glycan sites, while MS can confirm them in a week. For instance, glycosylation can occur at sites of asparagine (Asn, N) near serine (Ser, S) or threonine (Thr, T), or at the oxygen atom of the hydroxyl sidechain of Ser or Thr. Glycosylation PTMs can thus be described as N-linked or O-linked. Whereas an aa sequence motif exists for N-linked glycosylation (Asn-X-Ser/Thr, where X is any aa but a proline), there is no DNA/RNA or even protein motif/consensus sequence available for O-linked glycosylation PTMs [7]. Glycan identification in antibodies is particularly important as glycosylation of the antigen binding sites is important for avidity and stability [8, 9, 10].

Protein sequencing advantages over DNA sequencing

Despite the breadth of genomic information available, there is little proteomic information to support it. Some genomes have yet to be DNA sequenced and some remain incompletely sequenced or improperly annotated [11]. Even in completely DNA sequenced genomes, such as the human genome, the expected protein products from these genes have yet to be confirmed and this can be done most efficiently via MS [11, 12]. There are 20,000 to 25,000 genes in the human genome, but these genes produce more than the predicted canonical proteins due to alternative splicing and other modifications at the RNA and protein level [13]. The human proteome discovered to date has barely scratched the surface when it comes to accounting for additional proteins; some estimate that the total amount of proteins in our proteome reaches one billion [13]. The discoveries made through MS protein sequencing efforts will have great implications in many aspects of scientific and medical work.

Protein sequencing as a tool to strengthen research

Indeed, protein and DNA sequencing data can be used alongside one another to yield more robust evidence. At the American Society for Mass Spectrometry (ASMS 2019), our group reported data showing protein sequencing can be just as accurate as DNA sequencing and can fill in DNA sequencing gaps [14]. However, protein sequencing, and specifically de novo protein sequencing, has yet to be routinely used for proteome discovery. The most often used approach is protein identification by mass spectrometry that relies on homology searching and comparison using databases built from DNA sequencing information. However, they still offer a glimpse of protein sequencing’s great potential.

Most recently, a News Feature in Nature showcased the importance of MS-based protein identification of ancient human remains [15]. Through the passage of time, DNA may degrade, endure damage (resulting in missing segments), and become scarce. Reconstitution of the entire target genome through DNA sequencing then becomes a painstaking, near impossible task. This is the reason why samples must be as fresh as possible for DNA sequencing. However, proteins are often more stable and stick around longer than DNA. Proteins often remain in teeth and bone, and even if some degradation has occurred, the protein’s building blocks – peptides, will still be available to perform MS-based identification [15].

Such was the case in the study by Chen et al. published in Nature in May of this year [16]. This is the first study to identify ancient hominin with proteins, which marks an exciting milestone in proteomics. In their paper, they utilized already reported DNA data from other hominin and their own MS-based analysis of peptides of an ancient hominin found in the Tibetan Plateau in Baishiya Karst Cave, Xiahe, Gansu, China. Thanks to MS-based analysis, they could conclude that the fossils belonged to a Denisovan, providing evidence that Denisovans likely resided in a larger geographical area than previously thought – well outside the Altai Mountains, where the infamous Siberian Denisova Cave lies [16]. This study sets a precedent that protein sequencing could be used in the analysis of organisms’ fossils where DNA is unavailable or scant.

References

  1. B. Domon and R. Aebersold, “Mass Spectrometry and Protein Analysis,” Science, vol. 312, no. 5771, pp. 212-217, 2006.
  2. K. A. Resing and N. G. Ahn, “Proteomics strategies for protein identification,” FEBS Letters, vol. 579, no. 4, pp. 885-889, 2005.
  3. Catherman, AD, Skinner, OS, and Kelleher, NL. “Top Down Proteomics: Facts and Perspectives.” Biochem Biophys Res Commun. 445(4): 683–693. 2014
  4. D. P. Donnelly, C. . M. Rawlins, C. . J. DeHart, L. Fornelli, L. F. Schachner, Z. Lin, J. L. Lippens, K. C. Aluri, R. Sarin , B. Chen, C. Lantz , W. Jung , K. R. Johnson , A. Koller , J. J. Wolff , I. D. G. Campuzano, J. R. Auclair, A. R. Ivanov, J. P. Whitelegge, L. Paša-Tolić, J. Chamot-Rooke, P. O. Danis, L. M. Smith, Y O. Tsybin, J. A. Loo, Y. Ge, N. L. Kelleher and J. N. Agar, “Best practices and benchmarks for intact protein analysis for top-down mass spectrometry,” Nature Methods, vol. 16, no. 7, pp. 587-594, 2019.
  5. M. Kinter and N. E. Sherman, “Ionization Methods,” in Protein Sequencing and Identification Using Tandem Mass Spectrometry, D. M. Desiderio and N. M. M. Nibbering, Eds., Toronto, Wiley-Interscience, 2000, p. 35.
  6. M. Kinter and N. E. Sherman, “Ionization Methods,” in Protein Sequencing and Identification Using Tandem Mass Spectrometry, D. M. Desiderio and N. M. M. Nibbering, Eds., Toronto, Wiley-Interscience, 2000, p. 35.
  7. Y. Zhang, B. R. Fonslow, B. Shan, M.-C. Baek and J. R. Yates, III, “Protein Analysis by Shotgun/Bottom-up Proteomics,” Chemical Reviews, vol. 113, no. 4, p. 2343–2394, 2014.
  8. C. Wu, J. C. Tran, L. Zamdborg, K. R. Durbin, M. Li, D. R. Ahlf, B. P. Early, P. M. Thomas, J. V. Sweedler and N. L. Kelleher, “A Protease for Middle Down Proteomics,” Nature Methods, vol. 9, no. 8, pp. 822-824, 2012.

Sign up for the newsletter.

Sign up for the next webinar.

Whitepaper: Prevalence of Secondary Light Chains

Talk to Our Scientists.

We Have Sequenced 3000+ Antibodies and We Are Eager to Help You.

Through next generation protein sequencing, Rapid Novor enables reliable discovery and development of novel reagents, diagnostics, and therapeutics. Thanks to our Next Generation Protein Sequencing and antibody discovery services, researchers have furthered thousands of projects, patented antibody therapeutics, and developed the first recombinant polyclonal antibody diagnostics.

Talk to Our Scientists.

We Have Sequenced 3000+ Antibodies and We Are Eager to Help You.

Through next generation protein sequencing, Rapid Novor enables timely and reliable discovery and development of novel reagents, diagnostics, and therapeutics. Thanks to our Next Generation Protein Sequencing and antibody discovery services, researchers have furthered thousands of projects, patented antibody therapeutics, and ran the first recombinant polyclonal antibody diagnostics

Talk to our scientists. We have sequenced over 3000 antibodies and we are eager to help you.