Jan 20, 2024
The human intervention it still there, it just happens in a single step before training— i.e., "define a preference dataset"—rather than during an "online" RLHF process. Since the preference dataset is already part of standard RLHF, DPO removes entirely a step that is both expensive and time consuming. At least that's my understanding of it on a quick read (happy to be corrected).