Analysis by Synthesis Revisited: Visual Scene Understanding by Integrating Probabilistic Programs and Deep Learning

Analysis by Synthesis Revisited: Visual Scene Understanding by Integrating Probabilistic Programs and Deep Learning Photo

Joshua Tenenbaum

Everyday tasks like driving a car, preparing a meal, or watching a lecture, depend on the brain’s ability to compute mappings from raw sensory inputs to representations of the world. Despite the high dimensionality of the sensor data, humans can easily compute structured representations such as objects moving in three dimensional space and agents with multi-scale plans and goals. How are these mappings computed? How are they computed so well and so quickly, in real time? And how can we capture these abilities in machines?

Representation learning provides a task-oriented approach to obtaining abstract features from raw data, making such methods suitable for tasks such as object recognition. However, agents acting in 3D environments need to represent the world beyond statistical vectorized feature representations. The internal representations need to be lifted to an explicit 3D space with physical interactions and should have the ability to extrapolate to radically new viewpoints and other non-linear transformations. In this proposal, we aim to close this representational gap by integrating powerful deep learning techniques (recognition networks) with rich probabilistic generative models (inverse graphics) of generic object classes. This paradigm can termed as ‘analysis-by-synthesis’, where models or theories about the world are learnt to interpret perceptual observations.

The availability of ImageNet and other large datasets facilitated rapid innovation in the field of deep learning. Similarly the ATARI Learning Environment (ALE) led to a considerable amount of progress in deep reinforcement learning. To train our models we not only need computational tools required to scale-up our approach but we also need rich 3D learning environments. We plan to develop and release a 3D game engine which will be built upon the Unreal engine. It will be a centralized test bed for evaluating our models on a variety of tasks. This includes inferring 3D generative object models from 2D viewpoints, observing a sequence of frames then predicting ‘what happens next’, and learning to play a 3D maze game using the newly acquired knowledge of 3D representations. As computer graphics improve the production of photorealistic scenes, models trained in such environments will ultimately stand a solid chance of being applied to the real world by minimally fine-tuning models during ‘test’ time.

Publications:

Y. Li, T. Lin, K. Yi, D. M. Bear, D. L. K. Yamins, J. Wu, J. B. Tenenbaum, and A. Torralba, “Visual Grounding of Learned Physical Models,” in ICML 2020, 2020 [Online]. Available: https://icml.cc/Conferences/2020/ScheduleMultitrack?event=6550
A. Kloss, M. Bauza, J. Wu, J. B. Tenenbaum, A. Rodriguez, and J. Bohg, “Accurate Vision-based Manipulation through Contact Reasoning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 2020, pp. 6738–6744, doi: 10.1109/ICRA40945.2020.9197409 [Online]. Available: https://doi.org/10.1109/ICRA40945.2020.9197409
I. Yildirim, M. Belledonne, W. Freiwald, and J. Tenenbaum, “Efficient inverse graphics in biological face processing,” Science Advances, vol. 6, no. 10, Mar. 2020, doi: 10.1126/sciadv.aax5979. [Online]. Available: https://doi.org/10.1126/sciadv.aax5979
A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz, “ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models,” in NeurIPS 2019, 2019 [Online]. Available: https://papers.nips.cc/paper/9142-objectnet-a-large-scale-bias-controlled-dataset-for-pushing-the-limits-of-object-recognition-models
J.-Y. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. B. Tenenbaum, and W. T. Freeman, “Visual Object Networks: Natural Image Generation with Disentangled 3D Representation,” in NIPS 2018, 2018 [Online]. Available: http://papers.nips.cc/paper/7297-visual-object-networks-image-generation-with-disentangled-3d-representations.pdf
X. Zhang, Z. Zhang, C. Zhang, J. B. Tenenbaum, W. T. Freeman, and J. Wu, “Learning to Reconstruct Shapes from Unseen Classes,” in NIPS 2018, 2018 [Online]. Available: https://papers.nips.cc/paper/7494-learning-to-reconstruct-shapes-from-unseen-classes.pdf
Yilun Du, Zhijian Liu, Hector Basevi, Ales Leonardis, William T Freeman, J. B. Tenenbaum, and J. Wu, “Learning to Exploit Stability for 3D Scene Parsing,” in NIPS 2018, 2018 [Online]. Available: https://papers.nips.cc/paper/7444-learning-to-exploit-stability-for-3d-scene-parsing.pdf
K. Yi, C. Gan, P. Kohli, A. Torralba, Joshua B. Tenenbaum, and J. Wu, “Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding,” in NIPS 2018, 2018 [Online]. Available: https://papers.nips.cc/paper/7381-neural-symbolic-vqa-disentangling-reasoning-from-vision-and-language-understanding.pdf
S. Yao, T. M. H. Hsu, J.-Y. Zhu, J. Wu, A. Torralba, W. T. Freeman, and J. B. Tenenbaum, “3D-Aware Scene Manipulation via Inverse Graphics,” in NIPS 2018, 2018 [Online]. Available: https://papers.nips.cc/paper/7459-3d-aware-scene-manipulation-via-inverse-graphics.pdf
A. Ajay, J. Wu, N. Fazeli, M. Bauza, L. P. Kaelbling, J. B. Tenenbaum, and A. Rodriguez, “Augmenting Physical Simulators with Stochastic Neural Networks: Case Study of Planar Pushing and Bouncing,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, 2018, pp. 3066–3073, doi: 10.1109/IROS.2018.8593995 [Online]. Available: https://doi.org/10.1109/IROS.2018.8593995
S. Wang, J. Wu, X. Sun, W. Yuan, W. T. Freeman, J. B. Tenenbaum, and E. H. Adelson, “3D Shape Perception from Monocular Vision, Touch, and Shape Priors,” in IROS 2018, 2018, doi: 10.1109/IROS.2018.8593430 [Online]. Available: https://doi.org/10.1109/IROS.2018.8593430
X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, Tianfan Xue, Joshua B. Tenenbaum, and William T. Freeman, “Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling,” in IEEE CVPR 2018, 2018 [Online]. Available: https://doi.org/10.1109/CVPR.2018.00314
J. Wu, C. Zhang, X. Zhang, Z. Zhang, W. T. Freeman, and J. B. Tenenbaum, “Learning Shape Priors for Single-View 3D Completion And Reconstruction,” in Computer Vision – ECCV 2018, Cham, 2018, vol. 11215, pp. 673–691 [Online]. Available: https://doi.org/10.1007/978-3-030-01252-6_40. [Accessed: 16-Sep-2019]
Z. Zhang, Q. Li, Z. Huang, J. Wu, J. Tenenbaum, and B. Freeman, “Shape and Material from Sound,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 1278–1288 [Online]. Available: http://papers.nips.cc/paper/6727-shape-and-material-from-sound.pdf
J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum, “MarrNet: 3D Shape Reconstruction via 2.5D Sketches,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 540–550 [Online]. Available: http://papers.nips.cc/paper/6657-marrnet-3d-shape-reconstruction-via-25d-sketches.pdf
J. Wu, E. Lu, P. Kohli, B. Freeman, and J. Tenenbaum, “Learning to See Physics via Visual De-animation,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 152–163 [Online]. Available: http://papers.nips.cc/paper/6620-learning-to-see-physics-via-visual-de-animation.pdf
M. Janner, J. Wu, T. D. Kulkarni, I. Yildirim, and J. Tenenbaum, “Self-Supervised Intrinsic Image Decomposition,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 5938–5948 [Online]. Available: http://papers.nips.cc/paper/7175-self-supervised-intrinsic-image-decomposition.pdf
J. Wu, J. B. Tenenbaum, and P. Kohli, “Neural Scene De-Rendering,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, Honolulu, Hawaii, 2017 [Online]. Available: https://doi.org/10.1109/CVPR.2017.744
A. A. Soltani, H. Huang, J. Wu, T. D. Kulkarni, and J. B. Tenenbaum, “Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes with Deep Generative Networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, Honolulu, Hawaii, 2017 [Online]. Available: https://doi.org/10.1109/CVPR.2017.269
J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. Tenenbaum, “Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling,” in Advances in Neural Information Processing Systems 29, Barcelona, Spain, 2016, pp. 82–90 [Online]. Available: http://papers.nips.cc/paper/6096-learning-a-probabilistic-latent-space-…
J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman, “Single Image 3D Interpreter Network,” in Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI, Cham, 2016, pp. 365–382 [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46466-4_22

News:

Computer systems predict objects’ responses to physical forces

Videos:

Toyota - CSAILJoint Research Center

Toyota - CSAIL
Joint Research Center