Crossing the Vision-Language Boundary

An image showing correspondences between a visual image of a church on a street and their the individual spoken words
Learning to find relationships between visual and spoken inputs
James Glass, Antonio Torralba

Language is our primary means of communication, and speech is one of its most convenient and efficient means of conveyance.  Due to our ability to communicate by voice in hands-busy, eyes-busy environments, spoken interaction in the vehicle environment will become increasingly prevalent as cars become more intelligent, and are continuously perceiving (e.g., `listening' and `seeing') their environment. More generally, as humans and cognitive machines interact, it will be important for both parties to be able to speak with each other about things that they perceive in their local environment.  This project explores deep learning methods to create models that can automatically learn correspondences between things that a machine can perceive, and how people talk about them in ordinary spoken language.  The current project has successfully developed models with an ability to learn semantic relationships between objects in images and their corresponding spoken form.  The results of this research should have wide-scale applications to future cognitive vehicles and other cognitive machines that operate in same physical environment as humans.

Our recent research [Harwath et al., NIPS 2016; Harwath et al., ACL 2017; Leidel et al., ASRU 2017], investigates deep learning methods for learning semantic concepts across both audio and visual modalities. Contextually correlated streams of sensor data from multiple modalities - in this case a visual image accompanied by a spoken audio caption describing that image - are used to train networks capable of discovering patterns using otherwise unlabeled training data. For example, these networks are able to pick out instances of the spoken word ``water" from within continuous speech signals and associate them with images containing bodies of water. The networks learn these associations directly from the data, without the use of conventional speech recognition, text transcriptions, or any expert linguistic knowledge whatsoever.

We have been using the MIT Places image corpus, augmented with spoken descriptions of the images that we have collected via crowdsourcing, to learn a multi-modal embedding space using a neural network architecture.  We have explored several different models, and have been quantifying the performance of these models in image search and annotation retrieval tasks, where, given an image or a spoken image description, the task is to find the matching pair out of a set of 1000 candidates.  Currently we are able to achieve similar results to other models that have been trained on text captions, even though our models are only provided with unannotated audio signals. Thus, they could potentially operate on any spoken language in the world, without needing any additional annotation.

In ongoing work, we are planning to extend these models to video, include an audio channel, and explore datasets that are more focused on particular semantic concepts that are relevant for driving or interacting with robots.



  • D. Harwath and J. Glass, “Learning Word-Like Units from Joint Audio-Visual Analysis,” Proc. Association for Computational Linguistics, 2017, pp. 506–517 [Online]. Available: