A University of Maryland expert in machine learning is presenting two papers on how to quickly train aging AI systems to sift through medical and emergency data at a major upcoming conference on natural language processing.
Jordan Boyd-Graber, an associate professor of computer science with appointments in the University of Maryland Institute for Advanced Computer Studies, the iSchool, and the Language Science Center, will share the research at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). The event is taking place from November 16–20, and will be entirely online due to the ongoing COVID-19 pandemic.
“Interactive Refinement of Cross-Lingual Word Embeddings” presents Classifying Interactively with Multilingual Embeddings (CLIME), an interactive system that enhances cross-lingual word embeddings (CLWE). Monolingual word embeddings are pervasive in natural language processing, but to represent meaning and transfer knowledge across different languages, CLWE can be used.
Boyd-Graber and his team—which includes Michelle Yuan, a fourth-year doctoral student in computer science; Mozhi Zhang, a fifth-year computer science doctoral student; Benjamin Van Durme; an associate professor of computer science at Johns Hopkins University; and Leah Findlater, an associate professor in human centered design and engineering at the University of Washington—note that language technologies sometimes need to be quickly deployed in low-resource languages, or in languages that lack reference data to train the statistical natural language processing tools commonly available in high-resource languages.
For example, during the 2010 earthquake in Haiti, researchers used machine learning models to analyze social media and text messages to gain situational awareness in Haitian creole. The researchers say CLIME can help in these scenarios: users see which words related to the task that the system thinks are similar, and then corrects the model. Like in the case of an earthquake, you want the Haitian Creole word for “search” to be close to “rescue” but not “webpage” to help first responders sort through SMS messages asking for help.
The second paper, “Cold-start Active Learning through Self-Supervised Language Modeling,” seeks to make labeling data less complicated and time consuming. Boyd-Graber collaborated on that paper with Yuan and Hsuan-Tien Lin, an assistant professor of computer science and information engineering at National Taiwan University and chief data science consultant at Appier.
Machine learning models rely on large amounts of labeled data to learn new tasks, but annotations are not always readily available. For example, obtaining labels for medical text is challenging because of privacy issues or a shortage in expertise. And while modern machine learning methods called neural language models are the foundation of most modern tools for understanding language, these methods are sometimes confused by surprising data, e.g. those from rapidly changing medical situations.
In this paper, the researchers introduce ALPS, an active learning algorithm that determines which data to label for improving text classification: those most confusing to the neural language model. ALPS reduces labeling time and costs by finding out the information needed by the language model.
“We're presenting new algorithms and frameworks to effectively collect annotations for natural language processing models,” says Yuan, lead author on the papers and a member of the Computational Linguistics and Information Processing (CLIP) Laboratory. “I am excited to present these papers to spark interest in interactive machine learning.”
Computing & Health