MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

An Empirical Evaluation of LLMs for the Assessment of Subjective Qualities

Author(s)
Ranade, Esha
Thumbnail
DownloadThesis PDF (3.056Mb)
Advisor
Kagal, Lalana
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks and are increasingly being used for language generation. Significant advancements in this field have unlocked capabilities that enable their adoption in sophisticated roles, including acting as evaluators or "judges" of text for various attributes such as factuality, relevance, fluency, and reasoning quality. However, their understanding and ability to assess subjective attributes, such as the level of formality in a piece of writing, and produce content matching these subjective attributes remains unclear and underexplored. This research develops a methodology to study how LLMs evaluate subjective attributes. It has three primary contributions: (i) a reproducible user study to generate human-annotated labels for different attributes, (ii) an analysis of the extent to which different LLMs provide subjective labels aligned with human annotators, and (iii) an analysis of the extent to which LLMs generate content aligned with specified intended subjective labels, relative to humans. The user study and the analyses have been conducted both with and without a reference scale. The scale itself, the survey design, and the evaluation questions have all undergone multiple rounds of iteration informed by study tester feedback to improve clarity, consistency, and reliability for the final study. Comparisons between human-generated ratings and LLM-generated ratings for both human-generated content and LLM-generated content reveal the extent to which LLMs align with human judgment, providing insights into their capabilities and limitations. While humans typically do better in their roles, LLMs are able to attain reliably high levels of success in producing and judging text, despite tending to err on the more-formal side. Both groups’ performance increases significantly with the aid of a formalized reference scale. Across the suite of models tested, OpenAI’s GPT family leads overall performance, with Anthropic’s Claude and Meta’s LLaMA series showing notable strengths in specific formality ranges. Although this work focuses on the formality attribute of text, the methodology developed can be used to evaluate other subjective qualities of text, such as conciseness, usefulness, or persuasiveness. Ultimately, these findings may guide future efforts to fine-tune LLMs to produce text that more precisely matches the desired stylistic or ethical standards.
Date issued
2025-09
URI
https://hdl.handle.net/1721.1/164652
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.