video37 MIN PREMIUM

Direct Preference Optimization (DPO)

Rafailov, Sharma, Mitchell, Ermon, Manning, Finn · 2023 · DOI 10.48550/arXiv.2305.18290

SUMMARY

Aligning language models with human preferences without an explicit reward model — DPO derives a closed-form objective from RLHF's KL-constrained problem.

Unlock the full explainer

Premium subscribers get the full video, transcript, and code repository.

View pricing plans