
Ramani Duraiswami, a professor of computer science at the University of Maryland, has been selected to lead one of three research topics at a workshop organized by Johns Hopkins University’s Center for Language and Speech Processing (CLSP).
The event, part of a series that has been held annually for more than 30 years, will take place in Brno, Czechia, from June 23 to August 1. It will be co-hosted by the Brno University of Technology and Phonexia, a speech recognition software company.
Selected through a competitive bidding process, Duraiswami’s project focuses on bridging the gap between the potential and current capabilities of Large Audio-Language Models (LALMs), which process speech, sound and music inputs alongside language.
His team’s prior work in the field includes building their own LALM and developing the first ever open-sourced benchmark tailored for multimodal audio understanding. Using the benchmark, they tested LALMs from companies like Google and OpenAI and revealed significant limitations, with even state-of-the-art models achieving only 53% accuracy on complex audio reasoning tasks.
This deficiency stems from how research on audio-based AI has lagged behind modalities such as language and vision, primarily because of the lack of large training datasets and benchmarks for assessing advanced audio processing capabilities.
This workshop provides an opportunity to address these challenges by fostering collaboration between students and research from various disciplines. Participants from several universities and industries in the U.S., Europe and Asia, will spend six weeks working together to produce improved learning architectures, training methodologies, and benchmarks.
“Our ultimate goal is to develop a model that can analyze nuanced details in audio,” says Duraiswami, who has an appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). “For example, if it listens to a meeting, it should detect emotion in the voices, understand who's taking turns, and distinguish who’s aggressive.”
Duraiswami will be conducting the workshop alongside Santosh Kesiraju, a Ph.D. researcher from Brno University of Technology; Alicia Lozano Diez, an assistant professor at Universidad Autonoma de Madrid; and Leibny Paola Garcia, an assistant research scientist from Johns Hopkins University. The team also includes four UMD graduate students co-advised by Duraiswami and Dinesh Manocha, a Distinguished University Professor from UMD’s computer science department: Sreyan Ghosh, Sonal Kumar, Lasha Koroshinadze, and Sakshi, all of whom played a major role in Duraiswami’s prior research on LALMs.
Duraiswami and Manocha are both part of the University of Maryland Center for Machine Learning.
The CLSP workshop is preceded by a two-week summer school in Human Language Technology, which will feature two lecturers, and a hands-on lab exercise each day. Throughout the six-week program, the participants will also benefit from guest lectures, seminars, and team research updates across all three research teams.
Over its 30-year history, the CLSP workshops have had significant impact on research and industrial applications through the papers, software, and data that they produce.
Duraiswami expects the workshop to result in significant improvement in LALM performance while fostering new research collaborations. The outcomes, including open-source implementations and public datasets and benchmarks, are expected to benefit the broader research community.
—Story by Aleena Haroon, UMIACS communications group