A team of researchers from the University of Maryland has received a best paper award for their work to make visual question answering (VQA) systems more effective for people with visual impairments.
“What’s Different between Visual Question Answering for Machine ‘Understanding’ Versus for Accessibility?” aims to improve VQA systems by nudging them toward a more human-centric model where the goal is to answer questions that are likely to be useful to potential blind and visually impaired users, as opposed to questions written to test how well a machine “understands” images.
The paper was authored by Yang (Trista) Cao, a fifth-year computer science doctoral student; Kyle Seelman, a third-year computer science doctoral student; Kyungjun Lee, a seventh-year computer science doctoral student; and Hal Daumé III, a professor of computer science with appointments in the University of Maryland Institute for Advanced Computer Studies and the Language Science Center.
Cao and Seelman are members of the Computational Linguistics and Information Processing (CLIP) Laboratory while Lee is a member of the Human-Computer Interaction Lab (HCIL). Daumé is active in both CLIP and HCIL.
Their paper will be recognized at the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL- IJCNLP), which will be held online from November 20–23.
In VQA, a machine must answer a question based on an associated image, such as presenting it with a picture of a man playing basketball and asking which sport is being depicted.
This is easy for sighted humans to do, but more challenging for a computer because it requires the integration of both computer vision and natural language processing algorithms. To solve this task, a robust machine learning model requires a more general understanding of images—that is, the system must be able to answer completely different questions about an image and address various sections of it.
Accessibility researchers have explored whether VQA systems could be deployed in a real-world setting where visually impaired users learn about their environment by capturing their visual surroundings and then asking questions. However, most of the existing benchmarking datasets for VQA focus on machine-learning-based “understanding” and it remains unclear how progress on those datasets translates to improving accessibility.
The research team aims to answer this question by evaluating discrepancies between machine understanding datasets and accessibility datasets by evaluating a variety of VQA models.
They selected seven VQA models and two data sets to analyze—one for measuring machine learning understanding, and one that was collected from visually impaired people to improve accessibility.
The researchers found that model architecture advancements on machine understanding VQA also improve the performance on the accessibility task, but that the gap of the model performance between the two is still significant and is increasing. This widening gap in accuracy indicates that adapting model architectures that were developed for machine understanding to assist visually impaired people is challenging.
They also show that there may be a significant overfitting effect, where substantial model improvements on machine “understanding” VQA translate only into modest improvements in accessibility VQA.
“This suggests that if the research community continues to only hill-climb on challenge datasets like VQA-v2, we run the risk of ceasing to make any process on a pressing human-centered application of this technology, and, in the worst case, could degrade performance,” the researchers state in the paper.
Looking into the errors, the team finds that the models struggle most on questions that require text recognition skills as well as ambiguous questions. They suggest that future work pay more attention to these questions both in data collection and model design.
Additionally, they say automatic evaluation of VQA systems is reaching its limit and it would make sense to soon start including blind and low-vision users in their evaluation in a situation resembling a real-world setting.
—Story by Melissa Brachfeld