Vision-Language Models for Engineering Design: From Technical Documentation Benchmarking to CAD Generation
Author(s)
Doris, Annie Clare
DownloadThesis PDF (9.179Mb)
Advisor
Ahmed, Faez
Terms of use
Metadata
Show full item recordAbstract
Engineering product development is slowed by two bottlenecks: interpreting technical requirements and producing accurate, editable computer-aided design (CAD) models. This thesis evaluates and advances vision-language models (VLMs) – large-scale foundation models that process both text and images – to support engineers in these time-consuming tasks. While benchmarks exist for evaluating VLM performance in areas such as medical imaging, optical character recognition, and robotics, benchmarks for engineering design tasks remain scarce. We develop DesignQA, which remedies this problem, by combining visual data, textual design requirements, CAD images, and engineering drawings in a benchmark. It enables us to rigorously quantify the VLMs’ abilities to understand and apply engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines visual data – including textual design requirements, CAD images, and engineering drawings – derived from the Formula SAE student competition. The benchmark features automatic evaluation metrics and is divided into segments – Rule Comprehension, Rule Compliance, and Rule Extraction – based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models (at the time of writing) like GPT-4o, GPT-4, Claude-Opus, Gemini-1.0, and LLaVA-1.5 against the benchmark. Our study uncovers the existing gaps in VLMs’ abilities to interpret complex engineering documentation, including the inability to reliably retrieve relevant rules from the Formula SAE documentation and challenges in analyzing engineering drawings. These findings underscore the need for VLMs that can better handle the multifaceted questions characteristic of design according to technical documentation. After establishing an engineering-design-specific benchmark, we investigate whether additional training can improve VLM performance on engineering tasks. In particular, we address CAD generation from images, a problem motivated by scenarios such as sketch-toCAD workflows, recovery of lost files, or cases where only an image is available due to privacy concerns. While recent developments in AI-driven CAD generation show promise, existing models are limited by incomplete representations of CAD operations, an inability to generalize to real-world images, and low output accuracy. We develop CAD-Coder, an open-source VLM fine-tuned to generate CadQuery code directly from images, trained on GenCAD-Code (163,671 image–code pairs). On a 100-sample test subset, CAD-Coder outperforms strong VLM baselines (e.g., GPT-4.5, Qwen2.5-VL-72B), achieving a 100% valid-syntax rate and the highest 3D-solid similarity. It also shows early generalization, producing CAD code from real photographs and executing operations (e.g., filleting) not seen during fine-tuning. The performance and adaptability of CAD-Coder highlight the potential of VLMs fine-tuned on design-specific tasks to streamline workflows for engineers. We conclude with directions for design-specific VLMs, including synthetic-data pipelines to improve dataset coverage and reinforcement-learning strategies that exploit objective geometric rewards. Together, DesignQA and CAD-Coder indicate a practical path toward VLM assistants that accelerate requirement-aware engineering design and image-to-CAD workflows. All code, data, and trained models are released publicly to support reproducibility and future research.
Date issued
2025-09Department
Massachusetts Institute of Technology. Department of Mechanical EngineeringPublisher
Massachusetts Institute of Technology