In Chapter 9, we studied RLHF. We started from a simple asymmetry, for most tasks worth doing, good behavior is hard to write down but easy to recognize. So instead of writing rules, we collect human judgments and learn from them.
We then assembled the pipeline in stages. Supervised fine-tuning gave us a starting policy and a frozen reference model. The reward mod...