video37 MIN PREMIUM
Direct Preference Optimization (DPO)
Rafailov, Sharma, Mitchell, Ermon, Manning, Finn · 2023 · DOI 10.48550/arXiv.2305.18290
Read original paper SUMMARY
Aligning language models with human preferences without an explicit reward model — DPO derives a closed-form objective from RLHF's KL-constrained problem.
Unlock the full explainer
Premium subscribers get the full video, transcript, and code repository.
View pricing plans