On the journey from gene to protein, a nascent RNA molecule can be cut and joined, or spliced, in different ways before it is translated into a protein. This process, known as alternative splicing, allows a single gene to encode many different proteins. Alternative splicing occurs in many biological processes, such as when stem cells mature into tissue-specific cells. In the context of disease, however, alternative splicing can be dysregulated. Therefore, it is important to examine the transcriptome – that is, all the RNA molecules that can be derived from genes – to understand the root cause of a condition.
However, historically it has been difficult to “read” RNA molecules in their entirety because they are often thousands of bases long. Instead, researchers rely on so-called short-read RNA sequencing, which breaks RNA molecules and sequences them into much shorter pieces — anywhere from 200 to 600 bases, depending on platform and protocol. Computer programs are then used to reconstruct the complete sequences of RNA molecules. Short read RNA sequencing can provide highly accurate sequencing data, with a low per-base error rate of approximately 0.1% (meaning that one base is incorrectly determined for every 1000 bases sequenced). However, it is limited in the information it can provide due to the short length of sequencing reads. In many ways, short-read RNA sequencing is like breaking a large picture into several puzzle pieces of the same shape and size, and then trying to put the picture back together again.
Recently, “long read” platforms have become available that can sequence RNA molecules over 10,000 bases in length end-to-end. These platforms do not require the RNA molecules to be broken down before they can be sequenced, but they have a much higher error rate per base, typically between 5% to 20%. This well-known limitation has severely hampered the widespread adoption of long-read RNA sequencing. In particular, the high error rate has made it difficult to determine the validity of previously unknown new RNA molecules discovered in a given condition or disease.
To get around this problem, researchers at the Children’s Hospital of Philadelphia (CHOP) have developed a new computational tool that can more accurately discover and quantify RNA molecules from this error-prone, long-read RNA sequencing data. The tool, called ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options), was released today at advances in science.
“Long read RNA sequencing is a powerful technology that will allow us to discover RNA variations in rare genetic diseases and other conditions such as cancer,” said Yi Xing, PhD, director of CHOP’s Center for Computational Medicine and Genomics and lead author study senior. “We are likely at an inflection point in how we discover and analyze RNA molecules. The transition from short-read to long-read RNA sequencing represents an exciting technological transformation, and computational tools that reliably interpret RNA sequencing data Long read RNA are urgently needed.”
ESPRESSO can accurately discover and quantify different RNA molecules from the same gene – known as RNA isoforms – using only error-prone long-read RNA sequencing data. To do this, the computational tool compares all the long RNA sequencing reads of a given gene with its corresponding genomic DNA, and then uses the error patterns of individual long reads to confidently identify the splice junctions – places where the nascent RNA molecule was cut and joined together – as well as their corresponding full-length RNA isoforms. By finding areas of perfect matches between long RNA sequencing reads and genomic DNA, as well as obtaining information across all long RNA sequencing reads of a gene, the tool is able to identify highly reliable splice junctions and isoforms of a gene. RNA, including those that were not previously documented in existing databases.
The researchers evaluated the performance of ESPRESSO using simulated data and data from real biological samples. They found that ESPRESSO outperforms several currently available tools, both in terms of RNA isoform discovery and quantification. The researchers also generated and analyzed more than 1 billion long RNA sequencing reads covering 30 human tissue types and three human cell lines, providing a useful resource for studying human transcriptome variation in resolving full-length RNA isoforms. .
“ESPRESSO addresses an age-old problem of long-read RNA sequencing and may open up new discovery opportunities,” said Dr. Xing. “We anticipate that ESPRESSO will be a useful tool for researchers to explore the RNA cell repertoire in various biomedical and clinical settings.”
This work was supported in part by the Immuno-Oncology Translational Network (IOTN) of the Cancer Moonshot Initiative of the National Cancer Institute (U01CA233074), other National Institutes of Health Finance (R01GM088342, R01GM121827, and R56HG012310), along with a National Institutes of Health grant Training Course in Computational Genomics (T32HG000046).