MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Scaling 3D Scene Perception via Probabilistic Programming

Author(s)
Gothoskar, Nishad
Thumbnail
DownloadThesis PDF (124.9Mb)
Advisor
Mansinghka, Vikash K.
Tenenbaum, Joshua B.
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Understanding and interpreting the 3D structure of the world is a central challenge in artificial intelligence. Our physical world is 3D, yet our AI systems often “see” that world through pixels and images. In order to build truly intelligent AI systems, we must go beyond pixels and images and build 3D vision systems that can build meaningful and useful 3D representations of the world. This is the problem of 3D scene perception. How do we transform raw visual input into 3D representations of the world? 3D scene perception has numerous applications from robotics to augmented reality. Despite the advances over the last decade, 3D perception remains a major bottleneck in real-world robotics applications. The challenge stems from the immense variability in real-world conditions, e.g. lighting, color, viewpoint, camera properties, object appearance, the incompleteness of visual data due to limited resolution, noise, and occlusions, and the approximations in our models of visual data. Developing more robust and generalizable 3D perception systems would be an important step towards more general-purpose robotics. In this thesis, we explore a probabilistic architecture for 3D perception based on structured generative models and probabilistic programs. We begin with 3DP3, the first iteration of our approach, which infers 3D scene graphs from real-world depth image data. 3DP3 demonstrates that our method could work on real-world benchmarks and correct commonsense errors from deep learning systems. Building on this foundation, we develop Bayes3D, which scaled up these ideas using a GPU-accelerated image likelihood and generative model alongside a parallel coarse-to-fine inference algorithm. Next, we explore two approaches for incorporating RGB image data into generative 3D graphics programs, expanding their applicability. We then introduce DurableVS, which extends inverse-graphics techniques to model scenes involving a robot and multiple cameras, enabling precise control of a robot. Finally, we present Gen3D, which integrates all the key ideas from this thesis into a real-time 3D perception system that uses multi-resolution probabilistic models of 3D matter to enable real-time tracking that is competitive with vision transformers and 3D Gaussian splatting, state-of-the-art methods in computer vision and computer graphics.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/164030
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.