Using Language and Vision to Read Minds

Reading Minds with Vision and Language Photo
Photo credit:
Boris Katz and Nick Roy
Nicholas Roy, Boris Katz

Safe and controllable autonomous robots must understand and reason about the behavior of the people in the environment and be able to interact with them. For example, an autonomous car must understand the intent of other drivers and the subtle cues that signal their mental state. While much attention has been paid to basic enabling technologies like navigation, sensing, and motion planning, the social interaction between robots and humans has been largely overlooked. As robots grow more capable, this problem grows in magnitude since they become harder to manually control and understand.

We are addressing this problem by combining language understanding, perception, and planning into a single coherent approach. Language provides the interface that humans use to request actions, explain why an action was taken, inquire about the future, and more. By combining this with perception one can focus the attention of a machine on a particular part of the environment, use the environment to disambiguate language and to acquire new linguistic concepts, and much more. Planning provides the final piece of the puzzle. It not only allows machines to follow commands, but it also allows them to reason about the intent of other agents in the environment by assuming that they too are running similar but inaccessible planners. These planners are only observed indirectly through the actions and statements of those agents. In this way a joint language, vision, and planning approach enables machines to understand the physical and social world around them and to communicate that understanding in natural language.

In the first phase of this project, we addressed the problem enabling a robot to jointly reason about what actions it should take in future with what it knows about the workspace from visual observations as well as language utterances from a human. Present language understanding models do not have the ability to acquire knowledge about past events or understand facts from a human that may be relevant later. We introduced a novel probabilistic model that allows us to reason with past context and acquire knowledge over time. This is accomplished by presenting an incremental approach that combines event recognition, semantic parsing, state keeping and reasoning about future actions of the robot. We demonstrate the model on a Baxter Research Robot.

 

Video links:

https://youtu.be/VYZz3dJzu0s

https://youtu.be/ZbOmxfox2r0