The term proteomics will turn 30 in 2024. The concept of a more complex protein biochemistry can be traced back to the mid-1970s with the breakthrough of 2D gel analysis. Or even earlier: in 1958, Frederick Sanger, colloquially known as the father of genomics, accepted his first Nobel Prize for sequencing the first molecule— insulin, a protein.
Despite a protein molecule being the first to be sequenced, it was Sanger’s DNA sequencing—his second Nobel Prize—that went on to permanently alter the biological sciences, ushering scientists into the genomics era in the late 1970s.
Proteomics was coined to denote that genomics alone could not be used to fully explain biological processes. While the genomics revolution took on exponential growth since the late 1970s, proteomics has been slow to churn due to inaccessibility, technical challenges, and cost—big hurdles for the average biomedical lab looking to venture into proteomics.
Though challenges persist, new protein sequencing technologies are emerging with potential to meet these challenges, and perhaps open a door to mainstream proteomics.
Challenges in obtaining the protein sequence
Proteins are vital to our understanding of biological function and disease. But a protein’s structure is complex, the result of the odd mutation, frameshifts, alternative splicing, somatic recombination, post-transcriptional regulation, post-translational modifications (PTMs), proteolytic cleavages, and more—all of which are vital to a protein’s function.
DNA sequencing data alone cannot predict the final protein structure, or its expected quantity; however, direct access to the protein sequence can reveal whether the protein is expressed from the expected reading frame, whether it is the result of a frameshift, whether it needs to be modified to be biologically active, and much more.
The main methods currently used to identify proteins in proteomics are immunoassays, mass spectrometry (MS), and Edman degradation. Accessing the protein sequence through these methods has been difficult due to technical constrains and inaccessibility.
For instance, because proteins are not replicated—they are made ad hoc—it can be difficult to detect them, unlike genes, which can be amplified for detection and quantification. Some researchers highlight that proteins’ limited bioavailability may be a reason why 10 percent of the human proteome remains unelucidated, despite knowledge of the human genome.
Large purified quantities of a protein are typically required for MS and Edman degradation analysis. Another reason is that some proteins may be resistant to chemical or enzymatic manipulation; trypsin is often the enzyme of choice for MS analysis, but not all proteins are sensitive to its proteolysis.
Many more enzymes are now used in MS-based protein sequencing, but this results in complex datasets that require sophisticated algorithms for analysis. Unelucidated proteins and proteoforms could also be poor binders, avoiding detection.
Similar to enzymatic resistance, antibodies may not bind proteins during modified states, such as activation through phosphorylation. Eighty-five percent of proteins are not suitable for targeting by small molecule binders or antibodies, earning them the moniker of the “undruggable proteome.”
Furthermore, antibodies’ specificity and reproducibility is not always consistent, requiring extensive validation. Non-specific signaling and incorrect target binding are not uncommon for immunoassays, despite topping the list of accessible methods available for protein identification. Researchers can readily learn and perform immunoassays with little cost compared to MS.
Relatively accessible to non-proteomics researchers, Edman degradation, which involves the sequential cleavage of a protein via its exposed N-terminus, is limited by its low throughput and chemical incompatibility to certain PTMs that block the N-terminus.
Least accessible to the average biomedical researcher, MS, which measures the mass-to-charge ratio of ionized peptides, must be carried out by highly trained personnel. Plus, instruments like mass spectrometers tend to occupy significant space in labs, in contrast to DNA sequencing instruments that can fit snuggly at the corner of a bench. MS is also expensive to conduct routinely.
New technologies and innovations making an impact in proteomics
In recent years, scientists have sought inspiration from DNA sequencing by seeking ways to boost the signal of individual peptides and enable single-molecule sequencing without relying on prior knowledge of the genetic code.
Examples include Schaus et al. (2017)’s DNA nanoscope or DNA proximity recording that localizes and identifies specific amino acids through amplification of proximal DNA-barcoded probes with complementary primers, and nanopores—membrane pore that only permits single-file flow of molecules. But, like DNA proximity recording, “nanopore-based protein sensing is still in its infancy,” reads a 2021 Nature Methods’ review.
Recent innovative alternatives to Edman sequencing, fluorosequencing and N-terminus amino acid binding (NAAB), may be more suitable for mainstream adoption of protein identification, allowing diverse research groups to contribute to proteomics. Of the two, fluorosequencing’s complex chemistry requirements may incur problems such as chemical destruction of dyes, not reported with NAAB single-molecule protein sequencing technology.
The latter relies on NAAB proteins, which bind to N-terminal amino acids, and can be finetuned in specificity and affinity to different amino acids through directed evolution (yeast or phage display), already accessible in many molecular biology labs around the world, and much more cost-effective.
NAAB-based single-molecule protein sequencing involves binding C-terminus immobilized proteins or peptides to single amino acid-specific NAAB dye-labeled probes before or while enzymatic stepwise peptide cleavage occurs, revealing additional N-termini for several cycles until a sequence of the peptide is identified.
In 2022, a paper published in Science showed that immobilizing peptides into a semiconductor chip for NAAB-based single molecule protein sequencing could be successfully used to generate unique “fluorescence properties and pulsing kinetics” signatures for each amino acid. The distinct signals could then be used to train software to identify amino acids as fluorescence and kinetics were recorded per well in the semiconductor chip.
The latter study also used directed evolution to discover additional NAABs, which they dubbed “recognizers,” from ClpS aminopeptidases and UBR family of ubiquitin ligases showing that specificity and sensitivity can be optimized for single-molecule detection. This type of sensitivity is not yet found in mass spectrometry-based protein sequencing.
It’s no coincidence that the first molecule to be sequenced was a protein as they are vital to our understanding of biological function and disease. But for many years, the average research lab has mostly used immunoassays to study them, resorting when affordable to mass spectrometry.
Use of Edman sequencers came with limitations such as low throughput problems and other technical issues. Recent innovations such as single-molecule protein sequencing through NAAB proteins may pave the way to mainstream adoption of protein sequencing by offering high throughout capabilities and higher specificity not possible through Edman sequencing or mass spectrometry.