Category: RL Training | Zhongzhu's Blog

0%

RL Training Category

2026

04-14

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts — Deep Technical Review

04-14

ArmoRM：用“多目标奖励建模 + 混合专家门控”做可解释偏好学习——深度技术评审

04-07

ORPO: Monolithic Preference Optimization without Reference Model — In-Depth Technical Review

03-10

InstructGPT: The RLHF Recipe That Turned GPT-3 Into a Helpful Assistant