
In the ever-evolving world of computational biology, one of the greatest challenges has always been deriving knowledge and insight from the massive amount of both DNA (double helix) and RNA (single helix) data through which scientists must sift. For example, for those that remember, it took almost 13 years (1990–2003) to sequence and assemble the first near-complete draft of the human genome.
New advances in biotechnology, along with concordant computational advances, have dramatically shortened the timeline for such analyses, with a process known as “long-read sequencing”—equate it to reading voluminous amounts of genomic data in sentences and paragraphs instead of individual words—proving very effective.
Now, with funding from the National Institutes of Health (NIH), researchers from the University of Maryland and the University of North Carolina are taking methods for processing long-read RNA sequencing data to the next level. They’re developing an open-source software pipeline that they say can significantly improve the accuracy and accessibility of analyses based on these sequencing technologies.
A key component of these pipelines is a software package they have developed known as oarfish, which offers advanced techniques to minimize errors and deliver more accurate insights into gene activity. This includes analyzing complex transcriptomes—how genes turn “on” and “off” in different cells and tissues—which is crucial for investigating disease mechanisms, developing new drugs and identifying biomarkers.
Oarfish is intended to provide an accurate, efficient and easy-to-use interface for analysis of long-read RNA sequencing data, says Rob Patro, an associate professor of computer science and a principal investigator of the NIH grant.
“Our goal, with this new support from NIH, is to significantly scale-up the prior work we have done on oarfish, allowing those interested in working with long-read sequencing to develop smoother, more versatile, and more accurate workflows,” he says.
Assisting Patro on the project at UMD are Zahra Zare Jousheghani, a sixth-year doctoral student in computational biology, and Noor Pratap Singh, a sixth-year doctoral student in computer science. All three are active in the Center for Bioinformatics and Computational Biology, which is part of the University of Maryland Institute for Advanced Computer Studies, where Patro has a joint appointment and where business staff are managing the $1.4 million UMD portion of the NIH grant.
The UMD team is collaborating with Michael Love, an associate professor of biostatistics and genetics at the University of North Carolina, to expand the range of analyses that can be carried out with oarfish and to integrate oarfish into computationally-reproducible research pipelines that will be used in both academia and the private sector. This collaboration will ensure oarfish works seamlessly with other tools used to study gene activity and variation, Patro says, while also enhancing and extending upon an automated metadata tracking tool, known as tximeta, from a previous NIH award that supports improving the reproducibility and transparency of research processes in this area.
Looking ahead, the researchers will focus on advancing oarfish’s ability to handle complex datasets from multiple samples, and to correct for complex types of errors and biases that can occur during sequencing, while also ensuring it can be easily deployed on cloud platforms like AnVIL.
This will allow researchers to analyze even large-scale data without the need for specialized computing resources, Patro explains. To make oarfish even more user-friendly, the UMD and UNC team is developing step-by-step workflows, tutorials, and user guides, and they also plan to host both virtual and in-person workshops to ensure that researchers can fully benefit from the oarfish technology.
A strong focus on providing user support, and for making the technology accessible, well-documented, and easy-to-use, seemed to have an impact with the grant proposal reviewers at NIH—they not only recommended the funding for the project, but in the process of doing so scored the proposal in the top 1%.
“I was truly amazed when I saw the ranking,” Patro says. “It’s a rare achievement, and knowing our work is having such a significant impact in the field is incredibly rewarding.”
—Story by Melissa Brachfeld, UMIACS communications group