Beyond Preferences in AI Alignment

Zhi-Xuan, Tan; Carroll, Micah; Franklin, Matija; Ashton, Hal

dc.contributor.author	Zhi-Xuan, Tan
dc.contributor.author	Carroll, Micah
dc.contributor.author	Franklin, Matija
dc.contributor.author	Ashton, Hal
dc.date.accessioned	2024-11-12T17:25:01Z
dc.date.available	2024-11-12T17:25:01Z
dc.date.issued	2024-11-09
dc.identifier.uri	https://hdl.handle.net/1721.1/157530
dc.description.abstract	The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) that AI systems should be aligned with the preferences of one or more humans to ensure that they behave safely and in accordance with our values. Whether implicitly followed or explicitly endorsed, these commitments constitute what we term a preferentist approach to AI alignment. In this paper, we characterize and challenge the preferentist approach, describing conceptual and technical alternatives that are ripe for further research. We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and how utility representations neglect the possible incommensurability of those values. We then critique the normativity of expected utility theory (EUT) for humans and AI, drawing upon arguments showing how rational agents need not comply with EUT, while highlighting how EUT is silent on which preferences are normatively acceptable. Finally, we argue that these limitations motivate a reframing of the targets of AI alignment: Instead of alignment with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant. Furthermore, these standards should be negotiated and agreed upon by all relevant stakeholders. On this alternative conception of alignment, a multiplicity of AI systems will be able to serve diverse ends, aligned with normative standards that promote mutual benefit and limit harm despite our plural and divergent values.	en_US
dc.publisher	Springer Netherlands	en_US
dc.relation.isversionof	https://doi.org/10.1007/s11098-024-02249-w	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.source	Springer Netherlands	en_US
dc.title	Beyond Preferences in AI Alignment	en_US
dc.type	Article	en_US
dc.identifier.citation	Zhi-Xuan, T., Carroll, M., Franklin, M. et al. Beyond Preferences in AI Alignment. Philos Stud (2024).	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.relation.journal	Philosophical Studies	en_US
dc.identifier.mitlicense	PUBLISHER_CC
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2024-11-10T08:14:26Z
dc.language.rfc3066	en
dc.rights.holder	The Author(s)
dspace.embargo.terms	N
dspace.date.submission	2024-11-10T08:14:26Z
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: 11098_2024_Article_2249.pdf
Size:: 1.265Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record