MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Visually Accurate Database-Enabled Reconstructions of Scenes (VADERS)

Author(s)
Gosalia, Mehek
Thumbnail
DownloadThesis PDF (31.19Mb)
Advisor
Lieberman, Zachary
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
This work introduces a novel pipeline for scene reconstruction that jointly prioritizes semantic accuracy and visual fidelity, addressing a gap in current approaches. Prior pipelines often emphasize either semantic analysis or photorealistic rendering, but rarely both. This method combines scene analysis, segmentation, and retexturing to yield reconstructions that preserve structural semantics, while convincingly reflecting the visual qualities of the original image. The motivation lies in the limitations of existing systems. Existing databaseassisted approaches depend on proprietary datasets that restrict stylistic diversity or using in-the-wild assets. This constrains expressiveness and often produces results that are visually misaligned. Conversely, pipelines optimized for visual realism neglect semantic correctness, generating outputs that may appear plausible but lack categorical or structural grounding. Our framework addresses this by first enforcing semantic accuracy via selecting database assets, then editing those assets to be stylistically faithful to the reference, producing reconstructions that are both interpretable and expressive. We begin with database-assisted scene analysis, using an open-source asset database containing chairs, lamps, sofas, tables, and benches. Input images are depth-mapped, segmented, and parsed into object masks, which are matched to database assets based on semantic labels and visual correspondence. Each asset is broken into semantic segments and rescaled per-component using vision-language model predictions to match the reference object better. Finally the asset is retextured based on the image mask of the reference object in the input image. Evaluation on six diverse scenes—both photographs and artworks—shows the pipeline produces semantically grounded, visually accurate reconstructions under non-research conditions. Future work will focus on expanding the asset database, reducing reliance on proprietary texturing, and releasing an open-source implementation to broaden accessibility.
Date issued
2025-09
URI
https://hdl.handle.net/1721.1/164649
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.