Crossing the Vision-Language Boundary for Contextual Human-Vehicle Interaction

An image showing correspondences between a visual image of a church on a street and their the individual spoken words
Learning to find relationships between visual and spoken inputs
James Glass, Antonio Torralba

Language is our primary means of communication, and speech is one of its most convenient and efficient means of conveyance.  Due to our ability to communicate by voice in hands-busy, eyes-busy environments, spoken interaction in the vehicle environment will become increasingly prevalent as cars become more intelligent, and are continuously perceiving (e.g., ``listening'' and ``seeing'') their environment.

As they travel together, the vehicle and its human occupants share a common context of the world outside.  In addition to having sensors to perceive this world, a cognitive car will need to be able to converse with humans about this shared environment through spoken language.  To achieve this capability, the vehicle must be able to convert information between the perceptual and linguistic worlds - a form of cross-modal understanding.  Such an ability will open up a whole new realm of interaction opportunities as the vehicle and the occupants can communicate safety issues (``watch out for the debris in the road''), navigation (``there is a gas station right after the McDonalds on your right''), or general search (``what is the name of that tall building on the horizon?'').  Communication will need to be bi-directional; the vehicle must be able to convey perceptual information in spoken form to the occupants, and relate spoken input from the occupants to its visual perception of the local environment.

The research in this project will focus on a problem we call language-guided object detection, and define as an ability to take unconstrained spoken language input to determine an arbitrary object in a general, real-world, visual scene.  Such an ability would enable a vehicle occupant to direct the attention of the vehicle to a particular object in the local environment.  Conversely, an ability to generate an appropriate linguistic description of an arbitrary object in a visual scene will be useful to communicate with vehicle occupants.  A related question is whether joint information from the spoken and visual signals can help improve understanding.  The research aims to develop deep learning models that establish correspondences between visual and linguistic spaces on real-world visual and speech data.  The results of this research should have wide-scale applications to future cognitive vehicles and other cognitive machines that operate in same physical environment as humans.

In our work to date we have focused on collecting data that can be used for experimentation.  To this end we have created crowdsourcing tasks to record spoken descriptions of images.  We have collected 40K recordings of read descriptions from the Flikr8k corpus, and have an ongoing data collection effort devoted to collecting spontaneous speech descriptions of the MIT Places corpus.

We have performed two different kinds of experiments with these data.  With the Flikr8k recordings, we have experimented with adapting a speech recognizer language model according to word predictions produced from analysing an image with a caption generation neural network model.  By using this adapted language model, and additionally rescoring recognizer hypotheses with the image-based language model, we are able to reduce relative word error rates on the Flikr8k test set by 20-25%.

We have been using the Places audio-image recording pairs to learn a multi-modal embedding space using a neural network architecture.  We have explored several different models, and have been quantifying the performance of these models in image search and annotation retrieval tasks, where, given an image or a spoken image description, the task is to find the matching pair out of a set of 1000 candidates.  Currently we are able to achieve similar results to other models that have been trained on text captions, even though our models are only provided with unannotated audio signals.

In ongoing work, we are planning to explore methods to make individual correspondences between particular objects in an image, and spoken "word" counterparts in the associated recorded description.