InstructGPT: Training language models to follow instructions with human feedback

February 2023

tl;dr: Align LLM with RLHF post-training aligment, after large-scale pre-training.

Overall impression

This paper proposes InstructGPT, the backend model that powers ChatGPT. It provided a practical way of making products with generative model.

The paper showed that from proper finetuning with extra data and RL magic with human feedback, instructGPT can generate results strongly preferred by human evaluation with 100x less parameters. Finetuning with human feedback is a promising direction for aligning language models with human intent.

Mis-alignment is a serious issue for LLM, so much so that openAI even has a dedicated team to tackle it. This work is from openAI Alignment Team. We wan the LLM to be “helpful, honest and harmless”. InstructGPT is trained to maximize helpfulness, but evaluated for honesty/truthfulness and harmlessness.

The paper is more like an experiment report, without much scientific novelty. Yet it meticulously demonstrates the details of this ground-breaking engineering feat.

Key ideas

Technical details