Research Resources: Up-to-date Survey of LLM Value Alignment
Explore key resources about large language models (LLMs) value alignment, including papers, benchmarks and open-sourced projects. We aim to engage more researchers and facilitate an easy entry into this critical research area.
Taxonomy about Alignment Algorithms
RL-based Alignment
- Deep reinforcement learning from human preferences. Christiano et al. Neurips 2017. [paper]
- Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
- Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
- Glm: General language model pretraining with autoregressive blank infilling. Du et al. ACL 2022. [Paper][Project]
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Checkpoint]
- Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
- Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
- Reinforced self-training (rest) for language modeling. Gulcehre et al. arXiv 2023. [Paper]
SFT-based Alignment
MLE-based
- Lima: Less is more for alignment. Zhou et al. Neurips 2023. **[Paper]
- Self-instruct: Aligning language model with self generated instructions. Wang et al. arXiv 2022. [Paper][Data]
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Project][Data]
- Chain of hindsight aligns language models with feedback. Hao et al. arXiv 2023. [Paper][Project]
- Second thoughts are best: Learning to re-align with human values from text edits. Liu et al. Neurips 2022. [Paper]
- Training Socially Aligned Language Models in Simulated Human Society. Liu et al. arXiv 2023. [Paper][Project]
- Red-teaming large language models using chain of utterances for safety-alignment. Bhardwaj et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
Ranking-based
- Rrhf: Rank responses to align language models with human feedback without tears. Yuan et al. arXiv 2023. [Paper][Project]
- Direct preference optimization: Your language model is secretly a reward model. Rafailov et al. arXiv 2023. [Paper]
- Preference ranking optimization for human alignment. Song et al. arXiv 2023. [Paper][Project]
- Slic-hf: Sequence likelihood calibration with human feedback. Zhao et al. arXiv 2023. [Paper]
- A general theoretical paradigm to understand learning from human preferences. Azar et al. arXiv 2023. [Paper]
- Contrastive preference learning: Learning from human feedback without rl. Hejna et al. arXiv 2023. [Paper][Project]
In-Context Alignment
- The capacity for moral self-correction in large language models. Ganguli et al. arXiv 2023. [Paper]
- Critic: Large language models can self-correct with tool-interactive critiquing. Gou et al. arXiv 2023. [Paper][Project]
- Rain: Your language models can align themselves without finetuning. Li et al. ICLR 2024. [Paper][Project]
- An explanation of in-context learning as implicit bayesian inference. Xie et al. ICLR 2021. [Paper][Project][Data]
- In-context alignment: Chat with vanilla language models before fine-tuning. Han et al. arXiv 2023. [Paper][Project]
- The unlocking spell on base llms: Rethinking alignment via in-context learning. Lin et al. arXiv 2023. [Paper][Project]
- Align on the fly: Adapting chatbot behavior to established norms. Xu et al. arXiv 2023. [Paper][Project]
Personalized Alignment
- Recommendation as instruction following: A large language model empowered recommendation approach. Zhang et al. arXiv 2023. [Paper][Project]
- Tallrec: An effective and efficient tuning framework to align large language model with recommendation. Bao et al. RecSys 2023. [Paper][Project]
- Zero-shot next-item recommendation using large pretrained language models. Zhang et al. arXiv 2023. [Paper][Project]
- Palr: Personalization aware llms for recommendation. Chen et al. arXiv 2023. [Paper]
- Llm-empowered chatbots for psychiatrist and patient simulation: Application and evaluation. Chen et al. arXiv 2023. [Paper]
- Misc: A mixed strategy-aware model integrating comet for emotional support conversation. Tu et al. ACL 2023. [Paper][Project]
- Augesc: Large-scale data augmentation for emotional support conversation with pre-trained language models. Zheng et al. ACL 2023. [Paper][Project]
- Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. Peng et al. IJCAI 2022. [Paper]
- Polise: Reinforcing politeness using user sentiment for customer care response generation. Firdaus et al. COLING 2022. [Paper]
- Social simulacra: Creating populated prototypes for social computing systems. Park et al. UIST 2022. [Paper]
- Generative agents: Interactive simulacra of human behavior. Park et al. UIST 2023. [Paper][Project]
- Can large language models transform computational social science? Ziems et al. arXiv 2023. [Paper][Project]
- Whose opinions do language models reflect? Santurkar et al. ICML 2023. [Paper][Project]
- Lamp: When large language models meet personalization. Salemi et al. arXiv 2023. [Paper][Project]
- Chatplug: Open-domain generative dialogue system with internet-augmented instruction tuning for digital human. Tian et al. arXiv 2023. [Paper][Project]
Multimodal Alignment
- Visual instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
- Llavar: Enhanced visual instruction tuning for text-rich image understanding. Zhang et al. arXiv 2023. [Paper][Project]
- Visual Instruction Tuning with Polite Flamingo. Chen et al. arXiv 2023. [Paper][Project][Data]
- Aligning large multi-modal model with robust instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
- Better aligning text-to-image models with human preference. Wu et al. ICCV 2023. [Paper][Project]
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. Zhu et al. arXiv 2023. [Paper][Project]
- *Otter: A multi-modal model with in-context instruction tuning. **Li et al.* arXiv 2023. [Paper][Project]
- Multimodal-gpt: A vision and language model for dialogue with humans. Gong et al. arXiv 2023. [Paper][Project]
- Instructblip: towards general-purpose vision-language models with instruction tuning. Dai et al. Neurips 2023. [Paper][Project]
- Aligning text-to-image models using human feedback. Lee et al. arXiv 2023. [Paper]
Taxonomy about Alignment Goals
Human Instructions
Alignment Goal Representation
- Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
- Cross-task generalization via natural language crowdsourcing instructions. Mishra et al. arXiv 2021. [Paper][Data][Project]
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
- Glm-130b: An open bilingual pre-trained model. Zeng et al. arXiv 2022. [Paper][Project]
- Crosslingual generalization through multitask finetuning. Muennighoff et al. arXiv 2022. [Paper][Project]
- Unnatural instructions: Tuning language models with (almost) no human labor. Honovich et al. arXiv 2022. [Paper][Data]
- Self-instruct: Aligning language model with self generated instructions. Wang et al. arXiv 2022. [Paper][Data]
- Scaling instruction-finetuned language models. Chung et al. arXiv 2021. [Paper]
- The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
- Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
- Stanford alpaca: An instruction-following llama model. Taori et al. 2023 [Blog][Data]
- Vicuna: An open-source chatbot impressing gpt-4 with 90\% chatgpt quality. Chiang et al. See https://vicuna 2023. [Paper][Project][Data]
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. Xu et al. arXiv 2023. [Paper][Data]
- Improving multimodal interactive agents with reinforcement learning from human feedback. Abramson et al. arXiv 2022. [Paper]
- Aligning text-to-image models using human feedback. Lee et al. arXiv 2023. [Paper]
- Visual instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
- Llavar: Enhanced visual instruction tuning for text-rich image understanding. Zhang et al. arXiv 2023. [Paper][Project]
Alignment Goal Evaluation
Benchmarks
- Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
- The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
- Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Huang et al. arXiv 2023. [Paper][Project]
- Agieval: A human-centric benchmark for evaluating foundation models. Zhong et al. arXiv 2023. [Paper][Project]
- Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]
Automatic Chatbot Arenas
- Alpacaeval: An automatic evaluator of instruction-following models. Li et al. 2023. [Project]
- Alpacafarm: A simulation framework for methods that learn from human feedback. Dubois et al. arXiv 2023. [Paper][Project]
- Vicuna: An open-source chatbot impressing gpt-4 with 90\%* chatgpt quality. Chiang et al. See https://vicuna 2023. [Paper][Project]
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Zheng et al. arXiv 2023. [Paper][Project]
Human Preferences
Alignment Goal Representation
Human Demonstrations
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
- Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
- Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
- OpenAssistant Conversations–Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
- Reward design with language models. Kwon et al. arXiv 2023. [Paper]
Human Feedback
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
- Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
- Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
- OpenAssistant Conversations–Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
- Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. Wu et al. arXiv 2023. [Paper][Project][Data]
Model Synthetic Feedback
- Reward design with language models. Kwon et al. arXiv 2023. [Paper]
- Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
- Training Socially Aligned Language Models in Simulated Human Society. Liu et al. arXiv 2023. [Paper][Project]
- Training Language Models with Language Feedback at Scale. Jeremy Scheurer et al. arXiv 2023. [Paper][Data][Project]
- Visual Instruction Tuning with Polite Flamingo. Chen et al. arXiv 2023. [Paper][Project][Data]
Alignment Goal Evaluation
Benchmarks
- TruthfulQA: Measuring how models mimic human falsehoods. Lin et al. arXiv 2022. [Paper][Data]
- Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Mihaylov et al. EMNLP 2018. [Paper][Data]
- CrowS-pairs: A challenge dataset for measuring social biases in masked language models. Nangia et al. arXiv 2020. [Paper][Data]
- Gender bias in coreference resolution. Rudinger et al. arXiv 2018. [Paper][Data]
- BBQ: A hand-built bias benchmark for question answering. Parrish et al. arXiv 2021. [Paper][Data]
- Bold: Dataset and metrics for measuring biases in open-ended language generation. Dhamala et al. FAccT 2021. [Paper][Data]
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. Gehman et al. arXiv 2020. [Paper][Data]
- Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. Hartvigsen et al. arXiv 2022. [Paper][Data]
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
- Holistic evaluation of language models. Liang et al. arXiv 2022. [Paper][Project]
- Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]
- Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity. Zhuo et al. arXiv 2023. [Paper]
Human Evaluation
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
- Rrhf: Rank responses to align language models with human feedback without tears. Yuan et al. arXiv 2023. [Paper][Data]
- Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper]
- Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
Reward Model
- Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Checkpoint]
- Direct preference optimization: Your language model is secretly a reward model. Rafailov et al. arXiv 2023. [Paper]
- Raft: Reward ranked finetuning for generative foundation model alignment. Dong et al. arXiv 2023. [Paper]
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. Ramamurthy et al. arXiv 2022. [Paper][Project]
Human Values
Alignment Goal Representation
Value Principles
HHH (Helpful & Honest & Harmless)
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- A general language assistant as a laboratory for alignment. Askell et al. arXiv 2021. [Paper]
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
- Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
- Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Project][Data]
- Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]
Social Norms & Ethics
- The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
- Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
- Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
- Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
- Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
- MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions. Sun et al. ACL 2023. [Paper][Project]
- Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]
Target Representation
Desirable Behaviors
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. Ganguliet al. arXiv 2022. [Paper][Data]
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
- Social bias frames: Reasoning about social and power implications of language. Sap et al. arXiv 2019. [Paper][Data]
Values Principles
- Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
- Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Data]
- Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]
- The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
- Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
- Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
- Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]
Alignment Goal Evaluation
Benchmarks
Safety and Risk
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
- Safety Assessment of Chinese Large Language Models. Sun et al. arXiv 2023. [Paper][Data][Leaderboard]
- SafeText: A benchmark for exploring physical safety in language models. Levy et al. arXiv 2022. [Paper][Data]
- CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. Xu et al. arXiv 2023. [Paper][Project][Data]
Social Norms
- The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
- Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
- Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
- Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
- Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
- Moral mimicry: Large language models produce moral rationalizations tailored to political identity. Simmons et al. arXiv 2022. [Paper]
- When to make exceptions: Exploring language models as accounts of human moral judgment. Jin et al. Neurips 2022. [Paper][Project][Data]
- Towards Answering Open-ended Ethical Quandary Questions. Bang et al. arXiv 2022. [Paper]
Value Classifer
- Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper]
- Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]
Basic Values
Alignment Goal Representation
Basic Value Theory
- An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
- Rokeach value survey. Rokeach et al. The nature of human values. 1967. [Paper]
- Life values inventory: Facilitator’s guide. Brown et al. Willianmsburg, VA 2002. [Paper]
- Moral foundations theory: The pragmatic validity of moral pluralism. Graham et al. Advances in experimental social psychology, 2013. [Paper]
Alignment Goal Evaluation
Value Surveys
- Towards Measuring the Representation of Subjective Global Opinions in Language Models. Durmus et al. arXiv 2023. [Paper][Data]
- Culture’s consequences: International differences in work-related values. Hofstede et al. 1984. [Paper]
- World Values Survey Wave 7 (2017-2022). [URL]
- European Values Study. [URL]
- Pew Researcj Center’s Global Attitudes Surveys (GAS) [URL]
- An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
- Probing pre-trained language models for cross-cultural differences in values. Arora et al. arXiv 2022. [Paper]
Basic Value Classifier
- Valuenet: A new dataset for human value driven dialogue system. Qiu et al. AAAI 2022. [Paper][Project]
- Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment. Hoover et al. Social Psychological and Personality Science 2020. [Paper]
- Large pre-trained language models contain human-like biases of what is right and wrong to do. Schramowski et al. Nature Machine Intelligence 2022. [Paper]