We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:

References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Computers and Society

Title: Aligning AI With Shared Human Values

Abstract: We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
Comments: ICLR 2021; the ETHICS dataset is available at this https URL
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as: arXiv:2008.02275 [cs.CY]
  (or arXiv:2008.02275v5 [cs.CY] for this version)

Submission history

From: Dan Hendrycks [view email]
[v1] Wed, 5 Aug 2020 17:59:16 GMT (1667kb,D)
[v2] Mon, 21 Sep 2020 06:02:59 GMT (2071kb,D)
[v3] Tue, 12 Jan 2021 18:57:47 GMT (2986kb,D)
[v4] Thu, 4 Mar 2021 21:47:22 GMT (2986kb,D)
[v5] Sat, 24 Jul 2021 04:40:33 GMT (2779kb,D)

Link back to: arXiv, form interface, contact.