Training

How reinforcement learning transforms raw language models into useful assistants — from PPO’s four-model pipeline to DPO’s elegant shortcut to GRPO’s reasoning revolution, with the math that makes each one work.