Show simple item record

dc.contributor.advisorKagal, Lalana
dc.contributor.authorRanade, Esha
dc.date.accessioned2026-01-29T15:06:03Z
dc.date.available2026-01-29T15:06:03Z
dc.date.issued2025-09
dc.date.submitted2025-09-15T14:56:26.204Z
dc.identifier.urihttps://hdl.handle.net/1721.1/164652
dc.description.abstractLarge Language Models (LLMs) have achieved remarkable success in natural language processing tasks and are increasingly being used for language generation. Significant advancements in this field have unlocked capabilities that enable their adoption in sophisticated roles, including acting as evaluators or "judges" of text for various attributes such as factuality, relevance, fluency, and reasoning quality. However, their understanding and ability to assess subjective attributes, such as the level of formality in a piece of writing, and produce content matching these subjective attributes remains unclear and underexplored. This research develops a methodology to study how LLMs evaluate subjective attributes. It has three primary contributions: (i) a reproducible user study to generate human-annotated labels for different attributes, (ii) an analysis of the extent to which different LLMs provide subjective labels aligned with human annotators, and (iii) an analysis of the extent to which LLMs generate content aligned with specified intended subjective labels, relative to humans. The user study and the analyses have been conducted both with and without a reference scale. The scale itself, the survey design, and the evaluation questions have all undergone multiple rounds of iteration informed by study tester feedback to improve clarity, consistency, and reliability for the final study. Comparisons between human-generated ratings and LLM-generated ratings for both human-generated content and LLM-generated content reveal the extent to which LLMs align with human judgment, providing insights into their capabilities and limitations. While humans typically do better in their roles, LLMs are able to attain reliably high levels of success in producing and judging text, despite tending to err on the more-formal side. Both groups’ performance increases significantly with the aid of a formalized reference scale. Across the suite of models tested, OpenAI’s GPT family leads overall performance, with Anthropic’s Claude and Meta’s LLaMA series showing notable strengths in specific formality ranges. Although this work focuses on the formality attribute of text, the methodology developed can be used to evaluate other subjective qualities of text, such as conciseness, usefulness, or persuasiveness. Ultimately, these findings may guide future efforts to fine-tune LLMs to produce text that more precisely matches the desired stylistic or ethical standards.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleAn Empirical Evaluation of LLMs for the Assessment of Subjective Qualities
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record