Lifting 2D Vision Models into Structured Scene Representations

Tang, George

Author(s)

Tang, George

DownloadThesis PDF (29.86Mb)

Advisor

Torralba, Antonio

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Intelligent agents can leverage structured scene representations capable of capturing object compositionality, affordances, and semantics as a world emulator. However, 3D scene data is limited, rendering supervised and self-supervised methods ineffective. Recent advances in 2D foundation models exhibit remarkable performance and generalization. Concurrently, several works have demonstrated lifting feature maps produced by these models into a 3D feature representation. This thesis further explores how lifting can be effectively employed to construct pixel-level fidelity structured scene representations. Learned scene representations such as NeRF and Gaussian Splatting do not support additional functionality besides novel view rendering. The world is compositional: a scene can be described in terms of objects. Correspondingly, we present a lifting solution for efficient open-set 3D instance segmentation of learned scene representations. Compared to previous approaches, our solution is more than an order of magnitude faster and can handle scenes with orders of magnitude more instances. Toward identifying affordances, we tackle the problem of zero-shot mesh part segmentation. Learning-based mesh segmentation does not generalize due to a lack of diverse mesh segmentation datasets, while traditional shape analysis methods are overfitted to previous benchmarks. We present a lifting solution for mesh part segmentation that overcomes these limitations, showing comparable performance to top-performing shape-analysis methods on traditional benchmarks while exhibiting much better generalization on a novel mesh dataset curated from an image-to-3D model. Beyond feature fields, lifting can be used for a variety of applications, including scene understanding and editing. However, current lifting formulations are inefficient and often exhibit additional unintended modifications. To address these deficiencies, we generalize lifting to semantic lifting, which incorporates per-view masks indicating relevant areas. These masks are determined by querying corresponding per-view feature maps derived from feature fields. However, it is impractical to store per-view feature maps, and the scene representations can be expensive to store and query. To enable lightweight, on-demand retrieval of pixel-aligned relevance masks, we introduce a Vector Quantized Feature Field. We demonstrate the effectiveness of semantic lifting with our method on complex indoor and outdoor scenes from the LERF dataset.

Date issued

2025-02

URI

https://hdl.handle.net/1721.1/159118

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses