Research Resources: Up-to-date Survey of LLM Value Alignment

Explore key resources about large language models (LLMs) value alignment, including papers, benchmarks and open-sourced projects. We aim to engage more researchers and facilitate an easy entry into this critical research area.

Taxonomy about Alignment Algorithms

RL-based Alignment

Deep reinforcement learning from human preferences. Christiano et al. Neurips 2017. [paper]
Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
Glm: General language model pretraining with autoregressive blank infilling. Du et al. ACL 2022. [Paper][Project]
Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Checkpoint]
Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
Reinforced self-training (rest) for language modeling. Gulcehre et al. arXiv 2023. [Paper]

SFT-based Alignment

MLE-based

Lima: Less is more for alignment. Zhou et al. Neurips 2023. **[Paper]
Self-instruct: Aligning language model with self generated instructions. Wang et al. arXiv 2022. [Paper][Data]
Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Project][Data]
Chain of hindsight aligns language models with feedback. Hao et al. arXiv 2023. [Paper][Project]
Second thoughts are best: Learning to re-align with human values from text edits. Liu et al. Neurips 2022. [Paper]
Training Socially Aligned Language Models in Simulated Human Society. Liu et al. arXiv 2023. [Paper][Project]
Red-teaming large language models using chain of utterances for safety-alignment. Bhardwaj et al. arXiv 2023. [Paper][Project][Data][Checkpoint]

Ranking-based

Rrhf: Rank responses to align language models with human feedback without tears. Yuan et al. arXiv 2023. [Paper][Project]
Direct preference optimization: Your language model is secretly a reward model. Rafailov et al. arXiv 2023. [Paper]
Preference ranking optimization for human alignment. Song et al. arXiv 2023. [Paper][Project]
Slic-hf: Sequence likelihood calibration with human feedback. Zhao et al. arXiv 2023. [Paper]
A general theoretical paradigm to understand learning from human preferences. Azar et al. arXiv 2023. [Paper]
Contrastive preference learning: Learning from human feedback without rl. Hejna et al. arXiv 2023. [Paper][Project]

In-Context Alignment

The capacity for moral self-correction in large language models. Ganguli et al. arXiv 2023. [Paper]
Critic: Large language models can self-correct with tool-interactive critiquing. Gou et al. arXiv 2023. [Paper][Project]
Rain: Your language models can align themselves without finetuning. Li et al. ICLR 2024. [Paper][Project]
An explanation of in-context learning as implicit bayesian inference. Xie et al. ICLR 2021. [Paper][Project][Data]
In-context alignment: Chat with vanilla language models before fine-tuning. Han et al. arXiv 2023. [Paper][Project]
The unlocking spell on base llms: Rethinking alignment via in-context learning. Lin et al. arXiv 2023. [Paper][Project]
Align on the fly: Adapting chatbot behavior to established norms. Xu et al. arXiv 2023. [Paper][Project]

Personalized Alignment

Recommendation as instruction following: A large language model empowered recommendation approach. Zhang et al. arXiv 2023. [Paper][Project]
Tallrec: An effective and efficient tuning framework to align large language model with recommendation. Bao et al. RecSys 2023. [Paper][Project]
Zero-shot next-item recommendation using large pretrained language models. Zhang et al. arXiv 2023. [Paper][Project]
Palr: Personalization aware llms for recommendation. Chen et al. arXiv 2023. [Paper]
Llm-empowered chatbots for psychiatrist and patient simulation: Application and evaluation. Chen et al. arXiv 2023. [Paper]
Misc: A mixed strategy-aware model integrating comet for emotional support conversation. Tu et al. ACL 2023. [Paper][Project]
Augesc: Large-scale data augmentation for emotional support conversation with pre-trained language models. Zheng et al. ACL 2023. [Paper][Project]
Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. Peng et al. IJCAI 2022. [Paper]
Polise: Reinforcing politeness using user sentiment for customer care response generation. Firdaus et al. COLING 2022. [Paper]
Social simulacra: Creating populated prototypes for social computing systems. Park et al. UIST 2022. [Paper]
Generative agents: Interactive simulacra of human behavior. Park et al. UIST 2023. [Paper][Project]
Can large language models transform computational social science? Ziems et al. arXiv 2023. [Paper][Project]
Whose opinions do language models reflect? Santurkar et al. ICML 2023. [Paper][Project]
Lamp: When large language models meet personalization. Salemi et al. arXiv 2023. [Paper][Project]
Chatplug: Open-domain generative dialogue system with internet-augmented instruction tuning for digital human. Tian et al. arXiv 2023. [Paper][Project]

Multimodal Alignment

Visual instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
Llavar: Enhanced visual instruction tuning for text-rich image understanding. Zhang et al. arXiv 2023. [Paper][Project]
Visual Instruction Tuning with Polite Flamingo. Chen et al. arXiv 2023. [Paper][Project][Data]
Aligning large multi-modal model with robust instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
Better aligning text-to-image models with human preference. Wu et al. ICCV 2023. [Paper][Project]
Minigpt-4: Enhancing vision-language understanding with advanced large language models. Zhu et al. arXiv 2023. [Paper][Project]
*Otter: A multi-modal model with in-context instruction tuning. **Li et al.* arXiv 2023. [Paper][Project]
Multimodal-gpt: A vision and language model for dialogue with humans. Gong et al. arXiv 2023. [Paper][Project]
Instructblip: towards general-purpose vision-language models with instruction tuning. Dai et al. Neurips 2023. [Paper][Project]
Aligning text-to-image models using human feedback. Lee et al. arXiv 2023. [Paper]

Taxonomy about Alignment Goals

Human Instructions

Alignment Goal Representation

Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
Cross-task generalization via natural language crowdsourcing instructions. Mishra et al. arXiv 2021. [Paper][Data][Project]
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
Glm-130b: An open bilingual pre-trained model. Zeng et al. arXiv 2022. [Paper][Project]
Crosslingual generalization through multitask finetuning. Muennighoff et al. arXiv 2022. [Paper][Project]
Unnatural instructions: Tuning language models with (almost) no human labor. Honovich et al. arXiv 2022. [Paper][Data]
Self-instruct: Aligning language model with self generated instructions. Wang et al. arXiv 2022. [Paper][Data]
Scaling instruction-finetuned language models. Chung et al. arXiv 2021. [Paper]
The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
Stanford alpaca: An instruction-following llama model. Taori et al. 2023 [Blog][Data]
Vicuna: An open-source chatbot impressing gpt-4 with 90\% chatgpt quality. Chiang et al. See https://vicuna 2023. [Paper][Project][Data]
Baize: An open-source chat model with parameter-efficient tuning on self-chat data. Xu et al. arXiv 2023. [Paper][Data]
Improving multimodal interactive agents with reinforcement learning from human feedback. Abramson et al. arXiv 2022. [Paper]
Aligning text-to-image models using human feedback. Lee et al. arXiv 2023. [Paper]
Visual instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
Llavar: Enhanced visual instruction tuning for text-rich image understanding. Zhang et al. arXiv 2023. [Paper][Project]

Alignment Goal Evaluation

Benchmarks

Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Huang et al. arXiv 2023. [Paper][Project]
Agieval: A human-centric benchmark for evaluating foundation models. Zhong et al. arXiv 2023. [Paper][Project]
Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]

Automatic Chatbot Arenas

Alpacaeval: An automatic evaluator of instruction-following models. Li et al. 2023. [Project]
Alpacafarm: A simulation framework for methods that learn from human feedback. Dubois et al. arXiv 2023. [Paper][Project]
Vicuna: An open-source chatbot impressing gpt-4 with 90\%* chatgpt quality. Chiang et al. See https://vicuna 2023. [Paper][Project]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Zheng et al. arXiv 2023. [Paper][Project]

Human Preferences

Alignment Goal Representation

Human Demonstrations

Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
OpenAssistant Conversations–Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
Reward design with language models. Kwon et al. arXiv 2023. [Paper]

Human Feedback

Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
OpenAssistant Conversations–Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. Wu et al. arXiv 2023. [Paper][Project][Data]

Model Synthetic Feedback

Reward design with language models. Kwon et al. arXiv 2023. [Paper]
Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
Training Socially Aligned Language Models in Simulated Human Society. Liu et al. arXiv 2023. [Paper][Project]
Training Language Models with Language Feedback at Scale. Jeremy Scheurer et al. arXiv 2023. [Paper][Data][Project]
Visual Instruction Tuning with Polite Flamingo. Chen et al. arXiv 2023. [Paper][Project][Data]

Alignment Goal Evaluation

Benchmarks

TruthfulQA: Measuring how models mimic human falsehoods. Lin et al. arXiv 2022. [Paper][Data]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Mihaylov et al. EMNLP 2018. [Paper][Data]
CrowS-pairs: A challenge dataset for measuring social biases in masked language models. Nangia et al. arXiv 2020. [Paper][Data]
Gender bias in coreference resolution. Rudinger et al. arXiv 2018. [Paper][Data]
BBQ: A hand-built bias benchmark for question answering. Parrish et al. arXiv 2021. [Paper][Data]
Bold: Dataset and metrics for measuring biases in open-ended language generation. Dhamala et al. FAccT 2021. [Paper][Data]
Realtoxicityprompts: Evaluating neural toxic degeneration in language models. Gehman et al. arXiv 2020. [Paper][Data]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. Hartvigsen et al. arXiv 2022. [Paper][Data]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
Holistic evaluation of language models. Liang et al. arXiv 2022. [Paper][Project]
Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]
Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity. Zhuo et al. arXiv 2023. [Paper]

Human Evaluation

Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
Rrhf: Rank responses to align language models with human feedback without tears. Yuan et al. arXiv 2023. [Paper][Data]
Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper]
Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]

Reward Model

Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Checkpoint]
Direct preference optimization: Your language model is secretly a reward model. Rafailov et al. arXiv 2023. [Paper]
Raft: Reward ranked finetuning for generative foundation model alignment. Dong et al. arXiv 2023. [Paper]
Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. Ramamurthy et al. arXiv 2022. [Paper][Project]

Human Values

Alignment Goal Representation

Value Principles

HHH (Helpful & Honest & Harmless)

Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
A general language assistant as a laboratory for alignment. Askell et al. arXiv 2021. [Paper]
Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Project][Data]
Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]

The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions. Sun et al. ACL 2023. [Paper][Project]
Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]

Target Representation

Desirable Behaviors

Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. Ganguliet al. arXiv 2022. [Paper][Data]
Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
Social bias frames: Reasoning about social and power implications of language. Sap et al. arXiv 2019. [Paper][Data]

Values Principles

Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Data]
Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]
The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]

Alignment Goal Evaluation

Benchmarks

Safety and Risk

Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
Safety Assessment of Chinese Large Language Models. Sun et al. arXiv 2023. [Paper][Data][Leaderboard]
SafeText: A benchmark for exploring physical safety in language models. Levy et al. arXiv 2022. [Paper][Data]
CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. Xu et al. arXiv 2023. [Paper][Project][Data]

The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
Moral mimicry: Large language models produce moral rationalizations tailored to political identity. Simmons et al. arXiv 2022. [Paper]
When to make exceptions: Exploring language models as accounts of human moral judgment. Jin et al. Neurips 2022. [Paper][Project][Data]
Towards Answering Open-ended Ethical Quandary Questions. Bang et al. arXiv 2022. [Paper]

Value Classifer

Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]
Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper]
Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]

Basic Values

Alignment Goal Representation

Basic Value Theory

An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
Rokeach value survey. Rokeach et al. The nature of human values. 1967. [Paper]
Life values inventory: Facilitator’s guide. Brown et al. Willianmsburg, VA 2002. [Paper]
Moral foundations theory: The pragmatic validity of moral pluralism. Graham et al. Advances in experimental social psychology, 2013. [Paper]

Alignment Goal Evaluation

Value Surveys

Towards Measuring the Representation of Subjective Global Opinions in Language Models. Durmus et al. arXiv 2023. [Paper][Data]
Culture’s consequences: International differences in work-related values. Hofstede et al. 1984. [Paper]
World Values Survey Wave 7 (2017-2022). [URL]
European Values Study. [URL]
Pew Researcj Center’s Global Attitudes Surveys (GAS) [URL]
An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
Probing pre-trained language models for cross-cultural differences in values. Arora et al. arXiv 2022. [Paper]

Basic Value Classifier

Valuenet: A new dataset for human value driven dialogue system. Qiu et al. AAAI 2022. [Paper][Project]
Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment. Hoover et al. Social Psychological and Personality Science 2020. [Paper]
Large pre-trained language models contain human-like biases of what is right and wrong to do. Schramowski et al. Nature Machine Intelligence 2022. [Paper]

Research Resources: Up-to-date Survey of LLM Value Alignment

Taxonomy about Alignment Algorithms

RL-based Alignment

SFT-based Alignment

In-Context Alignment

Personalized Alignment

Multimodal Alignment

Taxonomy about Alignment Goals

Human Instructions

Alignment Goal Representation

Alignment Goal Evaluation

Benchmarks

Automatic Chatbot Arenas

Human Preferences

Alignment Goal Representation

Human Demonstrations

Human Feedback

Model Synthetic Feedback

Alignment Goal Evaluation

Benchmarks

Human Evaluation

Reward Model

Human Values

Alignment Goal Representation

Value Principles

HHH (Helpful & Honest & Harmless)

Social Norms & Ethics

Target Representation

Desirable Behaviors

Values Principles

Alignment Goal Evaluation

Benchmarks

Safety and Risk

Social Norms

Value Classifer

Basic Values

Alignment Goal Representation

Basic Value Theory

Alignment Goal Evaluation

Value Surveys

Basic Value Classifier