Research Resources: Up-to-date Survey of LLM Value Alignment

Explore key resources about large language models (LLMs) value alignment, including papers, benchmarks and open-sourced projects. We aim to engage more researchers and facilitate an easy entry into this critical research area.

Taxonomy about Alignment Algorithms

RL-based Alignment

  1. Deep reinforcement learning from human preferences. Christiano et al. Neurips 2017. [paper]
  2. Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
  3. Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
  4. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  5. Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
  6. Glm: General language model pretraining with autoregressive blank infilling. Du et al. ACL 2022. [Paper][Project]
  7. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Checkpoint]
  8. Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
  9. Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
  10. Reinforced self-training (rest) for language modeling. Gulcehre et al. arXiv 2023. [Paper]

SFT-based Alignment

MLE-based

  1. Lima: Less is more for alignment. Zhou et al. Neurips 2023. **[Paper]
  2. Self-instruct: Aligning language model with self generated instructions. Wang et al. arXiv 2022. [Paper][Data]
  3. Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Project][Data]
  4. Chain of hindsight aligns language models with feedback. Hao et al. arXiv 2023. [Paper][Project]
  5. Second thoughts are best: Learning to re-align with human values from text edits. Liu et al. Neurips 2022. [Paper]
  6. Training Socially Aligned Language Models in Simulated Human Society. Liu et al. arXiv 2023. [Paper][Project]
  7. Red-teaming large language models using chain of utterances for safety-alignment. Bhardwaj et al. arXiv 2023. [Paper][Project][Data][Checkpoint]

Ranking-based

  1. Rrhf: Rank responses to align language models with human feedback without tears. Yuan et al. arXiv 2023. [Paper][Project]
  2. Direct preference optimization: Your language model is secretly a reward model. Rafailov et al. arXiv 2023. [Paper]
  3. Preference ranking optimization for human alignment. Song et al. arXiv 2023. [Paper][Project]
  4. Slic-hf: Sequence likelihood calibration with human feedback. Zhao et al. arXiv 2023. [Paper]
  5. A general theoretical paradigm to understand learning from human preferences. Azar et al. arXiv 2023. [Paper]
  6. Contrastive preference learning: Learning from human feedback without rl. Hejna et al. arXiv 2023. [Paper][Project]

In-Context Alignment

  1. The capacity for moral self-correction in large language models. Ganguli et al. arXiv 2023. [Paper]
  2. Critic: Large language models can self-correct with tool-interactive critiquing. Gou et al. arXiv 2023. [Paper][Project]
  3. Rain: Your language models can align themselves without finetuning. Li et al. ICLR 2024. [Paper][Project]
  4. An explanation of in-context learning as implicit bayesian inference. Xie et al. ICLR 2021. [Paper][Project][Data]
  5. In-context alignment: Chat with vanilla language models before fine-tuning. Han et al. arXiv 2023. [Paper][Project]
  6. The unlocking spell on base llms: Rethinking alignment via in-context learning. Lin et al. arXiv 2023. [Paper][Project]
  7. Align on the fly: Adapting chatbot behavior to established norms. Xu et al. arXiv 2023. [Paper][Project]

Personalized Alignment

  1. Recommendation as instruction following: A large language model empowered recommendation approach. Zhang et al. arXiv 2023. [Paper][Project]
  2. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. Bao et al. RecSys 2023. [Paper][Project]
  3. Zero-shot next-item recommendation using large pretrained language models. Zhang et al. arXiv 2023. [Paper][Project]
  4. Palr: Personalization aware llms for recommendation. Chen et al. arXiv 2023. [Paper]
  5. Llm-empowered chatbots for psychiatrist and patient simulation: Application and evaluation. Chen et al. arXiv 2023. [Paper]
  6. Misc: A mixed strategy-aware model integrating comet for emotional support conversation. Tu et al. ACL 2023. [Paper][Project]
  7. Augesc: Large-scale data augmentation for emotional support conversation with pre-trained language models. Zheng et al. ACL 2023. [Paper][Project]
  8. Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. Peng et al. IJCAI 2022. [Paper]
  9. Polise: Reinforcing politeness using user sentiment for customer care response generation. Firdaus et al. COLING 2022. [Paper]
  10. Social simulacra: Creating populated prototypes for social computing systems. Park et al. UIST 2022. [Paper]
  11. Generative agents: Interactive simulacra of human behavior. Park et al. UIST 2023. [Paper][Project]
  12. Can large language models transform computational social science? Ziems et al. arXiv 2023. [Paper][Project]
  13. Whose opinions do language models reflect? Santurkar et al. ICML 2023. [Paper][Project]
  14. Lamp: When large language models meet personalization. Salemi et al. arXiv 2023. [Paper][Project]
  15. Chatplug: Open-domain generative dialogue system with internet-augmented instruction tuning for digital human. Tian et al. arXiv 2023. [Paper][Project]

Multimodal Alignment

  1. Visual instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
  2. Llavar: Enhanced visual instruction tuning for text-rich image understanding. Zhang et al. arXiv 2023. [Paper][Project]
  3. Visual Instruction Tuning with Polite Flamingo. Chen et al. arXiv 2023. [Paper][Project][Data]
  4. Aligning large multi-modal model with robust instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
  5. Better aligning text-to-image models with human preference. Wu et al. ICCV 2023. [Paper][Project]
  6. Minigpt-4: Enhancing vision-language understanding with advanced large language models. Zhu et al. arXiv 2023. [Paper][Project]
  7. *Otter: A multi-modal model with in-context instruction tuning. **Li et al.* arXiv 2023. [Paper][Project]
  8. Multimodal-gpt: A vision and language model for dialogue with humans. Gong et al. arXiv 2023. [Paper][Project]
  9. Instructblip: towards general-purpose vision-language models with instruction tuning. Dai et al. Neurips 2023. [Paper][Project]
  10. Aligning text-to-image models using human feedback. Lee et al. arXiv 2023. [Paper]

Taxonomy about Alignment Goals

Human Instructions

Alignment Goal Representation

  1. Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
  2. Cross-task generalization via natural language crowdsourcing instructions. Mishra et al. arXiv 2021. [Paper][Data][Project]
  3. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
  4. Glm-130b: An open bilingual pre-trained model. Zeng et al. arXiv 2022. [Paper][Project]
  5. Crosslingual generalization through multitask finetuning. Muennighoff et al. arXiv 2022. [Paper][Project]
  6. Unnatural instructions: Tuning language models with (almost) no human labor. Honovich et al. arXiv 2022. [Paper][Data]
  7. Self-instruct: Aligning language model with self generated instructions. Wang et al. arXiv 2022. [Paper][Data]
  8. Scaling instruction-finetuned language models. Chung et al. arXiv 2021. [Paper]
  9. The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
  10. Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
  11. Stanford alpaca: An instruction-following llama model. Taori et al. 2023 [Blog][Data]
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90\% chatgpt quality. Chiang et al. See https://vicuna 2023. [Paper][Project][Data]
  13. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. Xu et al. arXiv 2023. [Paper][Data]
  14. Improving multimodal interactive agents with reinforcement learning from human feedback. Abramson et al. arXiv 2022. [Paper]
  15. Aligning text-to-image models using human feedback. Lee et al. arXiv 2023. [Paper]
  16. Visual instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
  17. Llavar: Enhanced visual instruction tuning for text-rich image understanding. Zhang et al. arXiv 2023. [Paper][Project]

Alignment Goal Evaluation

Benchmarks

  1. Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
  2. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
  3. The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
  4. Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
  5. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
  6. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Huang et al. arXiv 2023. [Paper][Project]
  7. Agieval: A human-centric benchmark for evaluating foundation models. Zhong et al. arXiv 2023. [Paper][Project]
  8. Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]

Automatic Chatbot Arenas

  1. Alpacaeval: An automatic evaluator of instruction-following models. Li et al. 2023. [Project]
  2. Alpacafarm: A simulation framework for methods that learn from human feedback. Dubois et al. arXiv 2023. [Paper][Project]
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90\%* chatgpt quality. Chiang et al. See https://vicuna 2023. [Paper][Project]
  4. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Zheng et al. arXiv 2023. [Paper][Project]

Human Preferences

Alignment Goal Representation

Human Demonstrations

  1. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  2. Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
  3. Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
  4. Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
  5. OpenAssistant Conversations–Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
  6. Reward design with language models. Kwon et al. arXiv 2023. [Paper]

Human Feedback

  1. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  2. Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
  3. Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
  4. Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
  5. OpenAssistant Conversations–Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
  6. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. Wu et al. arXiv 2023. [Paper][Project][Data]

Model Synthetic Feedback

  1. Reward design with language models. Kwon et al. arXiv 2023. [Paper]
  2. Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
  3. Training Socially Aligned Language Models in Simulated Human Society. Liu et al. arXiv 2023. [Paper][Project]
  4. Training Language Models with Language Feedback at Scale. Jeremy Scheurer et al. arXiv 2023. [Paper][Data][Project]
  5. Visual Instruction Tuning with Polite Flamingo. Chen et al. arXiv 2023. [Paper][Project][Data]

Alignment Goal Evaluation

Benchmarks

  1. TruthfulQA: Measuring how models mimic human falsehoods. Lin et al. arXiv 2022. [Paper][Data]
  2. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Mihaylov et al. EMNLP 2018. [Paper][Data]
  3. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. Nangia et al. arXiv 2020. [Paper][Data]
  4. Gender bias in coreference resolution. Rudinger et al. arXiv 2018. [Paper][Data]
  5. BBQ: A hand-built bias benchmark for question answering. Parrish et al. arXiv 2021. [Paper][Data]
  6. Bold: Dataset and metrics for measuring biases in open-ended language generation. Dhamala et al. FAccT 2021. [Paper][Data]
  7. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. Gehman et al. arXiv 2020. [Paper][Data]
  8. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. Hartvigsen et al. arXiv 2022. [Paper][Data]
  9. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
  10. Holistic evaluation of language models. Liang et al. arXiv 2022. [Paper][Project]
  11. Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]
  12. Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity. Zhuo et al. arXiv 2023. [Paper]

Human Evaluation

  1. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  2. Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
  3. Rrhf: Rank responses to align language models with human feedback without tears. Yuan et al. arXiv 2023. [Paper][Data]
  4. Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper]
  5. Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]

Reward Model

  1. Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Checkpoint]
  3. Direct preference optimization: Your language model is secretly a reward model. Rafailov et al. arXiv 2023. [Paper]
  4. Raft: Reward ranked finetuning for generative foundation model alignment. Dong et al. arXiv 2023. [Paper]
  5. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. Ramamurthy et al. arXiv 2022. [Paper][Project]

Human Values

Alignment Goal Representation

Value Principles

HHH (Helpful & Honest & Harmless)
  1. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  2. A general language assistant as a laboratory for alignment. Askell et al. arXiv 2021. [Paper]
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
  4. Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
  5. Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
  6. Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Project][Data]
  7. Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]
Social Norms & Ethics
  1. The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
  2. Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
  3. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
  4. Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
  5. Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
  6. MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions. Sun et al. ACL 2023. [Paper][Project]
  7. Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]

Target Representation

Desirable Behaviors
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
  2. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. Ganguliet al. arXiv 2022. [Paper][Data]
  3. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  4. Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
  5. Social bias frames: Reasoning about social and power implications of language. Sap et al. arXiv 2019. [Paper][Data]
Values Principles
  1. Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
  2. Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
  3. Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Data]
  4. Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]
  5. The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
  6. Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
  7. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
  8. Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]

Alignment Goal Evaluation

Benchmarks

Safety and Risk
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
  2. Safety Assessment of Chinese Large Language Models. Sun et al. arXiv 2023. [Paper][Data][Leaderboard]
  3. SafeText: A benchmark for exploring physical safety in language models. Levy et al. arXiv 2022. [Paper][Data]
  4. CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. Xu et al. arXiv 2023. [Paper][Project][Data]
Social Norms
  1. The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
  2. Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
  3. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
  4. Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
  5. Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
  6. Moral mimicry: Large language models produce moral rationalizations tailored to political identity. Simmons et al. arXiv 2022. [Paper]
  7. When to make exceptions: Exploring language models as accounts of human moral judgment. Jin et al. Neurips 2022. [Paper][Project][Data]
  8. Towards Answering Open-ended Ethical Quandary Questions. Bang et al. arXiv 2022. [Paper]

Value Classifer

  1. Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper]
  3. Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]

Basic Values

Alignment Goal Representation

Basic Value Theory
  1. An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
  2. Rokeach value survey. Rokeach et al. The nature of human values. 1967. [Paper]
  3. Life values inventory: Facilitator’s guide. Brown et al. Willianmsburg, VA 2002. [Paper]
  4. Moral foundations theory: The pragmatic validity of moral pluralism. Graham et al. Advances in experimental social psychology, 2013. [Paper]

Alignment Goal Evaluation

Value Surveys
  1. Towards Measuring the Representation of Subjective Global Opinions in Language Models. Durmus et al. arXiv 2023. [Paper][Data]
  2. Culture’s consequences: International differences in work-related values. Hofstede et al. 1984. [Paper]
  3. World Values Survey Wave 7 (2017-2022). [URL]
  4. European Values Study. [URL]
  5. Pew Researcj Center’s Global Attitudes Surveys (GAS) [URL]
  6. An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
  7. Probing pre-trained language models for cross-cultural differences in values. Arora et al. arXiv 2022. [Paper]

Basic Value Classifier

  1. Valuenet: A new dataset for human value driven dialogue system. Qiu et al. AAAI 2022. [Paper][Project]
  2. Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment. Hoover et al. Social Psychological and Personality Science 2020. [Paper]
  3. Large pre-trained language models contain human-like biases of what is right and wrong to do. Schramowski et al. Nature Machine Intelligence 2022. [Paper]