Evaluating Differences in GPT-4 Treatment by Gender in Healthcare Applications
Author(s)
Pan, Eileen
DownloadThesis PDF (1.933Mb)
Advisor
Ghassemi, Marzyeh
Terms of use
Metadata
Show full item recordAbstract
LLMs already permeate medical settings, supporting patient messaging, medical scribing, and chatbots. While prior work has examined bias in medical LLMs, few studies focus on realistic use cases or analyze the source of the bias. To assess whether medical LLMs exhibit differential performance by gender, we audit their responses and investigate whether the disparities stem from implicit or explicit gender cues. We conduct a large-scale human evaluation of GPT-4 responses to medical questions, including counterfactual gender pairs for each question. Our findings reveal differential treatment based on the original patient gender. Specifically, responses for women more often recommend supportive resources, while those for men advise emergency care. Additionally, LLMs tend to downplay medical urgency for female patients and escalate it for male patients. Given rising interest in “LLM-as-a-judge” approaches, we also evaluate whether LLMs can serve as a proxy for human annotators in identifying disparities. We find that LLM-generated annotations diverge from human assessments in heterogeneous ways, particularly regarding error detection and relative urgency.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology