Researchers in the Center for Bioinformatics and Computational Biology (CBCB) have four papers accepted to the Conference on Research in Computational Molecular Biology (RECOMB 2023), held this year from April 16–19 in Istanbul.
The papers—which introduce new methods and tools to improve genome sequencing so that scientists can better study evolutionary trees, tumors and cancer—were coauthored by Erin Molloy, an assistant professor of computer science; Rob Patro, an associate professor of computer science; and their graduate students.
“RECOMB is long established as one of the very best international conferences in computational biology and bioinformatics,” says Michael Cummings, a professor of biology and director of CBCB. “The fact that collectively Rob and Erin, together with their students and collaborators, have four papers accepted, is demonstrative of the quality of their research. We are very pleased with their success.”
As part of CBCB, Cummings, Molloy (pictured center), and Patro have dual appointments in the University of Maryland Institute for Advanced Computer Studies.
One of Molloy’s papers, “TREE-QMC: Improving quartet graph construction for scalable and accurate species tree estimation from gene trees,” introduces a new method for estimating evolutionary trees.
Reconstructing the evolutionary history of life on Earth is a scientific grand challenge. Parts of this history have proven difficult to resolve because of highly heterogeneous and error-prone input data, she says.
In their paper, Molloy and Yunheng Han, a fourth-year computer science doctoral student, revisit a graph-based approach for building evoluntionary trees. Han and Molloy introduce efficient algorithms for graph construction and normalization, enabling their new method to be faster and more robust to error and heterogeneity than state-of-the-art methods.
The second paper being presented, “Spectrum preserving tilings enable sparse and modular reference indexing,” focuses on finding ways to efficiently index genomic sequences.
Many DNA and RNA sequence analyses query short sequences of fixed length, termed k-mers, against a collection of known reference sequences, like in human or bacterial genomes. To do so, analyses and algorithms use an index to rapidly find where each k-mer occurs on which reference.
In their paper, the researchers show how a class of these indexes can be broken down into two modular pieces, and how these modular indexes can be made both fast and small.
Jason Fan, the paper’s lead author and a fifth-year computer science doctoral student, says that their algorithm allows an index to tune space versus speed trade-offs by sampling positions of certain sub-sequences shared across references. It also has the potential to save scientists some money.
“If one were to modularly compose a state-of-the-art hash function with our compression algorithm, an index built over 30,000 bacterial genomes could be $300 cheaper per month to host on Amazon Web Services,” he explains.
Fan’s coauthors include Patro (pictured left); Jamshed Khan, a fourth-year computer science doctoral student; and Giulio Ermanno Pibiri, an assistant professor of computer science at Ca’ Foscari University of Venice.
Yuelin Liu, a fourth-year computer science doctoral student and visiting fellow at the National Cancer Institute (NCI), is the lead author of a paper that introduces a new tool to reconstruct tumor lineage trees so that scientists can better understand how tumors evolve, potentially leading to more effective and targeted treatments.
In addition to her coauthors at NCI, Liu’s UMD coauthors include Molloy; Xuan Cindy Li, a doctoral student of biological science who is also a visiting fellow at NCI; David R. Crawford, another biological science doctoral student and research fellow at NCI; Stephen Mount, an associate professor of cell biology and molecular genetics; and Eytan Ruppin, formerly a professor of computer science and director of CBCB who is now chief of NCI’s Cancer Data Science Laboratory.
The evolution of a tumor can be modeled as a phylogenetic tree, and the goal of single-cell tumor lineage reconstruction is to trace such history using single-cell sequencing data from the lesions of a cancer patient, says Liu. However, constructing tumor lineage trees with a single-cell mutation or copy number data often suffers from data sparsity.
In their paper, the researchers introduce Sgootr, a tool to jointly infer a tumor lineage tree and identify lineage-informative CpG sites from single-cell methylation sequencing data.
Methylation is a biological process by which methyl groups are added to the DNA molecule and change the activity of a DNA segment without changing the sequence. CpG methylation is widely studied type of epigenetic modification.
Using their tool Sgootr, the team successfully reconstructed tumor lineage trees from both real and simulated single-cell methylation datasets, showing that CpG methylation harbors rich signals to evaluate tumors.
The final paper, “TreeTerminus–Creating transcript trees using inferential replicate counts,” was accepted to RECOMB-Seq, a satellite conference that took place April 14–15. It was coauthored by Noor Pratap Singh, a fourth-year computer science doctoral student; Michael I. Love, an associate professor of biostatistics and genetics at the University of North Carolina at Chapel Hill; and Patro.
Genes and transcripts are the two most common base units of analysis for an RNA-sequencing dataset. While transcripts provide the finest resolution of analysis, it might be hard to infer the true abundance estimates for some of them. On the other hand, the gene abundance estimates are more accurate, but there is also a risk of losing the underlying information of which transcripts are responsible for the observed effects.
“Our method tries to get the best of both worlds, by arranging transcripts in a tree-like structure, where we get more confident about the abundance estimates as we climb up the tree,” Singh says. “We then propose a method to find the appropriate nodes from the tree and use those nodes as the base unit of downstream analysis. The nodes represent aggregated transcript sets and are quite often at a level below genes with a more certain abundance estimate.”
The paper by Singh, Love and Patro was recognized with a Best Paper/Talk award at the satellite conference.