Using Language and Vision to Read Minds

Reading Minds with Vision and Language Photo
Photo credit:
Boris Katz and Nick Roy
Nicholas Roy, Boris Katz

Safe and controllable autonomous robots must understand and reason about the behavior of the people in the environment and be able to interact with them. For example, an autonomous car must understand the intent of other drivers and the subtle cues that signal their mental state. While much attention has been paid to basic enabling technologies like navigation, sensing, and motion planning, the social interaction between robots and humans has been largely overlooked. As robots grow more capable, this problem grows in magnitude since they become harder to manually control and understand.

We are addressing this problem by combining language understanding, perception, and planning into a single coherent approach. Language provides the interface that humans use to request actions, explain why an action was taken, inquire about the future, and more. By combining this with perception one can focus the attention of a machine on a particular part of the environment, use the environment to disambiguate language and to acquire new linguistic concepts, and much more. Planning provides the final piece of the puzzle. It not only allows machines to follow commands, but it also allows them to reason about the intent of other agents in the environment by assuming that they too are running similar but inaccessible planners. These planners are only observed indirectly through the actions and statements of those agents. In this way a joint language, vision, and planning approach enables machines to understand the physical and social world around them and to communicate that understanding in natural language.

In the first phase of this project, we addressed the problem enabling a robot to jointly reason about what actions it should take in future with what it knows about the workspace from visual observations as well as language utterances from a human. Present language understanding models do not have the ability to acquire knowledge about past events or understand facts from a human that may be relevant later. We introduced a novel probabilistic model that allows us to reason with past context and acquire knowledge over time. This is accomplished by presenting an incremental approach that combines event recognition, semantic parsing, state keeping and reasoning about future actions of the robot. We demonstrate the model on a Baxter Research Robot.

Our ongoing work focuses on enhancing the model with the ability to perform deductive reasoning with the goal of enabling interpretation of instructions that necessitate logical inference over both rules and facts acquired by the robot. Our model also allows the robot to answer queries based on knowledge acquired from the workspace.  Further, the robot can query the human in case of multiple grounding hypothesis for clarification and uses the response to disambiguate an action. We are also exploring the ability to acquire new previously-unknown concepts related to object types be jointly learning a model of attributes described in language and visual attributes derived from detections. We are also working on integrating our model with Toyota's Human Support Robot in a simulation environment.

Video links for demonstration on the Baxter Research Robot


  • Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context, Rohan Paul, Andrei Barbu, Sue Felshin, Boris Katz and Nicholas Roy, International Joint Conference on Artificial Intelligence (IJCAI) 2017
  • Grounding with Visual-Linguistic Context, Rohan Paul, Andrei Barbu, Sue Felshin, Boris Katz and Nicholas Roy, Language Grounding in Robotics Workshop in the Annual Meeting of the Association of Computational Linguistics (ACL) 2017

Research Team:

  • Postdocs: Rohan Paul, Subhro Roy. Research Scientist: Andrei Barbu and Senior Research Scientist: Sue Felshin.