02-17 Direct Preference Optimization: Your Language Model Is Secretly a Reward Model — Technical Review