Lifting 2D Vision Models into Structured Scene Representations
Author(s)
Tang, George
DownloadThesis PDF (29.86Mb)
Advisor
Torralba, Antonio
Terms of use
Metadata
Show full item recordAbstract
Intelligent agents can leverage structured scene representations capable of capturing object compositionality, affordances, and semantics as a world emulator. However, 3D scene data is limited, rendering supervised and self-supervised methods ineffective. Recent advances in 2D foundation models exhibit remarkable performance and generalization. Concurrently, several works have demonstrated lifting feature maps produced by these models into a 3D feature representation. This thesis further explores how lifting can be effectively employed to construct pixel-level fidelity structured scene representations.
Learned scene representations such as NeRF and Gaussian Splatting do not support additional functionality besides novel view rendering. The world is compositional: a scene can be described in terms of objects. Correspondingly, we present a lifting solution for efficient open-set 3D instance segmentation of learned scene representations. Compared to previous approaches, our solution is more than an order of magnitude faster and can handle scenes with orders of magnitude more instances.
Toward identifying affordances, we tackle the problem of zero-shot mesh part segmentation. Learning-based mesh segmentation does not generalize due to a lack of diverse mesh segmentation datasets, while traditional shape analysis methods are overfitted to previous benchmarks. We present a lifting solution for mesh part segmentation that overcomes these limitations, showing comparable performance to top-performing shape-analysis methods on traditional benchmarks while exhibiting much better generalization on a novel mesh dataset curated from an image-to-3D model.
Beyond feature fields, lifting can be used for a variety of applications, including scene understanding and editing. However, current lifting formulations are inefficient and often exhibit additional unintended modifications. To address these deficiencies, we generalize lifting to semantic lifting, which incorporates per-view masks indicating relevant areas. These masks are determined by querying corresponding per-view feature maps derived from feature fields. However, it is impractical to store per-view feature maps, and the scene representations can be expensive to store and query. To enable lightweight, on-demand retrieval of pixel-aligned relevance masks, we introduce a Vector Quantized Feature Field. We demonstrate the effectiveness of semantic lifting with our method on complex indoor and outdoor scenes from the LERF dataset.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology