James B Maxwell
Jan 20, 2024

--

The human intervention it still there, it just happens in a single step before training— i.e., "define a preference dataset"—rather than during an "online" RLHF process. Since the preference dataset is already part of standard RLHF, DPO removes entirely a step that is both expensive and time consuming. At least that's my understanding of it on a quick read (happy to be corrected).

--

--

James B Maxwell
James B Maxwell

Written by James B Maxwell

Composer, musician, programmer, technologist, PhD

Responses (2)