Scaling 3D Scene Perception via Probabilistic Programming

Gothoskar, Nishad

Author(s)

Gothoskar, Nishad

DownloadThesis PDF (124.9Mb)

Advisor

Mansinghka, Vikash K.

Tenenbaum, Joshua B.

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Understanding and interpreting the 3D structure of the world is a central challenge in artificial intelligence. Our physical world is 3D, yet our AI systems often “see” that world through pixels and images. In order to build truly intelligent AI systems, we must go beyond pixels and images and build 3D vision systems that can build meaningful and useful 3D representations of the world. This is the problem of 3D scene perception. How do we transform raw visual input into 3D representations of the world? 3D scene perception has numerous applications from robotics to augmented reality. Despite the advances over the last decade, 3D perception remains a major bottleneck in real-world robotics applications. The challenge stems from the immense variability in real-world conditions, e.g. lighting, color, viewpoint, camera properties, object appearance, the incompleteness of visual data due to limited resolution, noise, and occlusions, and the approximations in our models of visual data. Developing more robust and generalizable 3D perception systems would be an important step towards more general-purpose robotics. In this thesis, we explore a probabilistic architecture for 3D perception based on structured generative models and probabilistic programs. We begin with 3DP3, the first iteration of our approach, which infers 3D scene graphs from real-world depth image data. 3DP3 demonstrates that our method could work on real-world benchmarks and correct commonsense errors from deep learning systems. Building on this foundation, we develop Bayes3D, which scaled up these ideas using a GPU-accelerated image likelihood and generative model alongside a parallel coarse-to-fine inference algorithm. Next, we explore two approaches for incorporating RGB image data into generative 3D graphics programs, expanding their applicability. We then introduce DurableVS, which extends inverse-graphics techniques to model scenes involving a robot and multiple cameras, enabling precise control of a robot. Finally, we present Gen3D, which integrates all the key ideas from this thesis into a real-time 3D perception system that uses multi-resolution probabilistic models of 3D matter to enable real-time tracking that is competitive with vision transformers and 3D Gaussian splatting, state-of-the-art methods in computer vision and computer graphics.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/164030

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses