Visually Accurate Database-Enabled Reconstructions of Scenes (VADERS)

Gosalia, Mehek

Author(s)

Gosalia, Mehek

DownloadThesis PDF (31.19Mb)

Advisor

Lieberman, Zachary

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

This work introduces a novel pipeline for scene reconstruction that jointly prioritizes semantic accuracy and visual fidelity, addressing a gap in current approaches. Prior pipelines often emphasize either semantic analysis or photorealistic rendering, but rarely both. This method combines scene analysis, segmentation, and retexturing to yield reconstructions that preserve structural semantics, while convincingly reflecting the visual qualities of the original image. The motivation lies in the limitations of existing systems. Existing databaseassisted approaches depend on proprietary datasets that restrict stylistic diversity or using in-the-wild assets. This constrains expressiveness and often produces results that are visually misaligned. Conversely, pipelines optimized for visual realism neglect semantic correctness, generating outputs that may appear plausible but lack categorical or structural grounding. Our framework addresses this by first enforcing semantic accuracy via selecting database assets, then editing those assets to be stylistically faithful to the reference, producing reconstructions that are both interpretable and expressive. We begin with database-assisted scene analysis, using an open-source asset database containing chairs, lamps, sofas, tables, and benches. Input images are depth-mapped, segmented, and parsed into object masks, which are matched to database assets based on semantic labels and visual correspondence. Each asset is broken into semantic segments and rescaled per-component using vision-language model predictions to match the reference object better. Finally the asset is retextured based on the image mask of the reference object in the input image. Evaluation on six diverse scenes—both photographs and artworks—shows the pipeline produces semantically grounded, visually accurate reconstructions under non-research conditions. Future work will focus on expanding the asset database, reducing reliance on proprietary texturing, and releasing an open-source implementation to broaden accessibility.

Date issued

2025-09

URI

https://hdl.handle.net/1721.1/164649

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses