--- | Zhongzhu's Blog

1. Executive Summary

OGER (Offline-Guided Exploration Reward) introduces a novel framework for enhancing Large Language Model (LLM) reasoning by seamlessly integrating offline teacher trajectories with online reinforcement learning. The key innovation lies in positioning offline data as a semantic reference point for computing auxiliary exploration rewards, rather than treating it as additional training samples.

The framework addresses critical limitations in current RLVR (Reinforcement Learning with Verifiable Rewards) approaches: the "echo chamber" effect where models converge to dominant pre-existing distributions, and entropy collapse that prevents novel solution discovery. By computing divergence-based exploration rewards and refining them through entropy-aware modulation, OGER achieves 4-7.9% improvements across mathematical and general reasoning benchmarks.

2. What This Paper Does

OGER tackles a fundamental problem in modern LLM training: models trained with standard RLVR approaches often amplify pre-existing capabilities rather than discovering genuinely novel problem-solving strategies.

Core Problem: When LLMs undergo reinforcement learning for reasoning tasks, they tend to converge rapidly toward the dominant reasoning patterns present in their training data. This creates what researchers call an "echo chamber" effect - the model explores a narrow slice of the solution space and reinforces iterative improvements within that slice, rather than discovering fundamentally different approaches.

OGER's Solution: Rather than treating offline teacher trajectories as static training data (as in standard imitation learning), OGER employs them as a dynamic reference framework for computing exploration rewards. The method works through three integrated mechanisms:

Semantic Divergence Computation: Embedding online and offline trajectories in a shared latent space, then computing similarity scores
Entropy-Aware Reward Refinement: Using model confidence (measured by token-level entropy) to modulate exploration intensity
Hybrid Sampling Strategy: Maintaining a balanced mixture of online and offline trajectories during training

The result is a principled exploration mechanism that encourages the model to venture into reasoning territory that differs from teacher demonstrations while maintaining training stability through entropy constraints.

3. Prerequisite Knowledge

3.1 Core Concepts Required

Reinforcement Learning Fundamentals:

Policy π(a|s): Probability distribution over actions given state
Reward function R(s,a): Immediate scalar feedback for transitions
Value functions: V(s) and Q(s,a) representing expected cumulative returns
Policy gradient methods: Updating parameters to maximize J(θ) = E_π[Σγ^t r_t]
Importance sampling: Computing expectations under one distribution using samples from another

RLVR Paradigm (Reinforcement Learning with Verifiable Rewards):

Applies RL to domains where correctness can be automatically verified (math, logic)
Provides sparse binary rewards: 1 for correct solutions, 0 otherwise
Requires trajectory sampling followed by verification
Exemplified by DeepSeek-R1 which uses RLVR for extended reasoning

GRPO Algorithm (Group Relative Policy Optimization):

State-of-the-art RL method for LLM training
Computes advantages relative to group statistics (not baseline)
Uses importance-weighted clipping for stability
Forms the optimization backbone that OGER builds upon

Advanced Topics:

Entropy and its measurement in probability distributions
Divergence metrics: KL divergence, cosine similarity in embedding spaces
Semantic similarity via learned representations
Trajectory embedding and latent space geometry

3.2 The Problem with "Echo Chamber Effect"

When LLMs undergo RL training without careful guidance, they exhibit what researchers call the echo chamber effect. This manifests as:

Narrow Solution Space Exploration: The model discovers one or two working solutions and iteratively improves within that narrow space, missing fundamentally different approaches.
Convergence to Local Optima: Standard RL converges to local optima within the model's initial solution distribution, rather than discovering global improvements.
Reduced Generalization: By overspecializing to narrow solution patterns, the model generalizes poorly to problems with different structures.
Entropy Collapse: Policy entropy drops rapidly (from ~6 bits to ~1 bit within 5K training iterations), leaving no room for exploration.

Examples:

Math problem: Model learns one calculation method (e.g., approach A), then refines it. Never discovers equally-valid alternative method B that would work better for subclass of problems.
Logic puzzle: Model settles on one reasoning pattern and sticks with it, missing more elegant approaches.
General reasoning: Model learns to output "correct-looking" reasoning, not necessarily diverse reasoning.

3.3 Motivation for Integrated Approach

Previous work has explored two separate directions:

Offline Guidance Approaches:

Luffy (Yan et al., 2025): Uses teacher trajectories as demonstrations
Chord (Zhang et al., 2025a): Implements sophisticated trajectory filtering
Strength: Leverage high-quality expert demonstrations
Weakness: Often treat offline data as static; lack deep online-offline integration

Entropy-Based Regularization:

Entropy maximization: Keep policy exploration high
Token-level entropy tracking: Monitor per-position uncertainty
Entropy-aware advantage shaping: Weight advantages by uncertainty
Strength: Theoretically motivated, prevents mode collapse
Weakness: Limited by model capacity; doesn't exploit offline knowledge

OGER's innovation is unifying these approaches through reward modeling, creating a synergistic combination where offline guidance informs exploration while entropy-aware refinement maintains stability.

4. Core Technical Contributions

4.1 Trajectory Embedding and Similarity Framework

The Embedding Space: OGER constructs a shared embedding space where both online and offline trajectories can be compared semantically. This is more sophisticated than simple string similarity:

1 2	For trajectory τ and encoder Enc: τ_embedded = Enc(concat([reasoning_steps, intermediate_results, final_answer]))

The encoder (typically a lightweight transformer) learns to capture semantic meaning rather than surface-level similarity.

Similarity Computation: For each online trajectory τ_i^on and offline trajectory τ_j^off:

1 2	s_{i,j} = cos_similarity(Enc(τ_i^on), Enc(τ_j^off)) = (Enc(τ_i^on) · Enc(τ_j^off)) / (\|\|Enc(τ_i^on)\|\| · \|\|Enc(τ_j^off)\|\|)

Averaging across M teacher trajectories:

1	sim_i = (1/M) Σ_{j=1}^M s_{i,j}

This average similarity reflects how much the online model's trajectory mimics the teacher distribution.

Divergence as Exploration Signal:

1	D_i = 1 - sim_i ∈ [0, 1]

Key insight: Trajectories most different from teachers receive highest exploration incentives. Trajectories similar to teachers receive low incentives, preventing redundant imitation.

4.2 Entropy-Aware Reward Modulation

Theoretical Motivation: Model entropy reflects aleatoric (data) uncertainty. High entropy suggests the model is uncertain about its predictions; low entropy suggests confidence. By using entropy to modulate exploration:

Confident, diverse trajectories → amplify exploration bonus
Uncertain, diverse trajectories → dampen exploration bonus

Mathematical Formulation:

1	H_i^last = -Σ_{v∈V} p_v log p_v

where p_v is the probability of token v at the final token position.

Entropy-Weighted Exploration Reward:

1 2	R_i^OGER = D_i · exp(-H_i^last) · R_i^m = (1 - sim_i) · exp(-H_i^last) · R_i^m

Why This Form Works:

Exponential Decay: As entropy increases, the coefficient decreases exponentially
- H=0 (zero entropy, 100% confidence) → exp(0) = 1.0 → full exploration bonus
- H=1.0 (moderate entropy) → exp(-1.0) ≈ 0.37 → ~37% of exploration bonus
- H=5.0 (high entropy) → exp(-5.0) ≈ 0.007 → ~0.7% of exploration bonus
Conservative Exploration: Prevents the model from exploring wildly in low-confidence regions
Interaction with Divergence: Only diverse AND confident trajectories get strong rewards

4.3 Hybrid Sampling Mechanism

Practical Training Procedure: For each training iteration:

Sample N trajectories from the online policy
Embed all N online trajectories
Compute divergence D_i for each online trajectory
Identify the trajectory with minimum divergence (most "teacher-like")
Replace it with a random sample from the offline dataset
This creates hybrid batch T_hyb of size N

Why Replace Rather Than Mix:

Prevents Data Imbalance: Keeps batch size constant at N
Diversity: Ensures fresh offline data rather than repetition
Stability: Maintains training stability by controlling ratios
Interpretability: Clear mechanism for offline-online integration

Gated Reward Composition:

R_i^total = {
    R_i^m + R_i^OGER,   if τ_i ∈ T_on (online trajectory)
    R_i^m,               if τ_i ∈ T_off (offline trajectory)
}

Critical Design Decision: Exploration rewards R^OGER apply ONLY to online trajectories. Offline teacher trajectories use standard verifiable rewards only. This separation prevents:

Noise injection into teacher demonstrations
Conflicting signals from teacher + exploration
Overfitting to offline diversity patterns

4.4 Integration with GRPO Optimization

Group-Relative Advantage: Unlike standard RL that uses baseline estimates, GRPO computes advantages relative to group statistics:

1	A_i = (R_i^total - mean_group) / std_group

This normalization:

Reduces variance in advantage estimates
Makes training more stable with small batch sizes
Provides fairer comparison within the batch

Policy Update:

1	L_GRPO(θ) = (1/(Σ_i \|τ_i\|)) Σ_i Σ_t min(r_{i,t}(θ)A_i, clip(r_{i,t}(θ), 1-ε, 1+ε)A_i)

where r_{i,t}(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the importance weight.

KL Divergence Treatment: The paper omits the traditional KL penalty:

1	NOT using: L_KL = β D_KL[π_θ \|\| π_ref]

Rationale: With robust advantage normalization and intrinsic exploration rewards, KL becomes redundant and can hinder exploration.

4.5 Multi-Teacher Collaborative Training

Offline Dataset Composition: OGER leverages trajectories from multiple state-of-the-art teacher models:

Teacher Model	Samples	Avg Length	Accuracy	Reasoning Style
DeepSeek-R1	45,462	4,021 tokens	99.28%	Direct, efficient
Qwen-3-32B	36,958	5,252 tokens	94.90%	Exploratory, detailed
GLM-4.5 Air	17,887	10,318 tokens	82.14%	Verbose, thorough

Diversity Benefits:

Different teachers exhibit different reasoning patterns
DeepSeek-R1: Concise, direct approaches to solutions
Qwen: More exploratory with intermediate thoughts
GLM: Very detailed reasoning with extensive intermediate steps

Divergence Computation Implications:

Online model forced to explore relative to multiple reference distributions
Cannot simply replicate one teacher; must discover novel patterns
Richer semantic reference space for divergence calculation

5. Experimental Evaluation

5.1 Evaluation Setup

Test Harness:

Mathematical Reasoning: MATH-500 (diverse math olympiad problems)
Mathematical Reasoning: GSM8K (grade school math)
General Reasoning: MMLU (multiple choice across domains)
General Reasoning: ARC-Challenge (science questions)
General Reasoning: CommonsenseQA (commonsense reasoning)

Model Scales:

1.5B parameter model: Small, efficient, good for edge deployment
7B parameter model: Medium-scale, good balance of capability and efficiency

Baseline Comparisons:

Vanilla GRPO: No offline guidance, baseline RL approach
SFT-only: Supervised fine-tuning on teacher trajectories
Standard Offline RL: CQL-style conservative offline learning
Entropy-Driven Baseline: Pure entropy regularization without offline data
Data-Mixing Baselines: Simple mixing of offline and online samples
Offline Imitation: Methods like Luffy that leverage offline data differently

5.2 Main Results

Mathematical Reasoning Performance:

Method	MATH-500 (1.5B)	MATH-500 (7B)	GSM8K (1.5B)	GSM8K (7B)
Vanilla GRPO	32.5%	48.2%	58.3%	71.4%
SFT-only	35.2%	50.1%	61.5%	73.2%
Entropy-Driven	36.1%	51.3%	62.8%	74.1%
Offline Imitation	37.4%	52.5%	64.2%	75.3%
OGER	39.8%	54.2%	67.1%	76.8%
OGER Gain	+7.3pp	+6.0pp	+8.8pp	+5.4pp

General Reasoning Performance:

Method	MMLU (1.5B)	MMLU (7B)	ARC (1.5B)	CommonsenseQA (1.5B)
Vanilla GRPO	42.1%	58.5%	51.2%	62.3%
Entropy-Driven	43.8%	59.7%	52.1%	63.5%
OGER	45.3%	61.2%	54.7%	66.8%
OGER Gain	+3.2pp	+2.7pp	+3.5pp	+4.5pp

Key Observations:

Consistent Gains: OGER improves on all benchmarks tested
Larger Benefits for Math: 4-7.3pp improvements on mathematical reasoning
Robust to Scale: Works for both 1.5B and 7B models
Synergistic Effect: Outperforms individual offline or online approaches

5.3 Training Dynamics and Convergence Analysis

Score Evolution During Training:

Iteration 0:     Score ≈ 32% (vanilla GRPO baseline)
Iteration 1K:    Score ≈ 38% (rapid initial improvement)
Iteration 5K:    Score ≈ 45% (steady improvement)
Iteration 10K:   Score ≈ 48% (approaching saturation)
Iteration 20K:   Score ≈ 50% (fine-tuning regime)

Compared to baseline GRPO:

OGER shows faster convergence in early training (Iterations 0-1K)
Maintains momentum longer (Iterations 1K-10K)
Reaches higher plateau (Iterations 10K+)

Entropy Evolution:

GRPO Baseline Entropy:
Iteration 0:     H ≈ 6.2 (high initial entropy)
Iteration 1K:    H ≈ 4.1 (rapid collapse)
Iteration 5K:    H ≈ 2.3 (severe collapse)
Iteration 20K:   H ≈ 1.1 (minimal exploration)

OGER Entropy:
Iteration 0:     H ≈ 6.2 (same baseline)
Iteration 1K:    H ≈ 5.8 (gradual decrease)
Iteration 5K:    H ≈ 4.5 (controlled decrease)
Iteration 20K:   H ≈ 3.2 (maintains exploration)

Interpretation: OGER prevents entropy collapse, maintaining 2-3x higher entropy than baseline. This enables continued discovery of new reasoning patterns.

Divergence Pattern:

OGER Divergence Values:
Iteration 0:     D ≈ 0.72 (high initial divergence)
Iteration 5K:    D ≈ 0.48 (moderate divergence)
Iteration 10K:   D ≈ 0.35 (controlled divergence)
Iteration 20K:   D ≈ 0.28 (stable low divergence)

Three-Phase Training Pattern:

Exploration Phase (0-2K iter): High D values, model tries diverse strategies
Consolidation Phase (2K-10K iter): D decreases, model focuses on effective directions
Refinement Phase (10K+ iter): Low stable D, fine-tuning within effective regions

5.4 Ablation Studies

Component-Level Contributions:

Component	Removed	Accuracy Drop	% of Total Gain
Divergence Reward (D_i)	-	-2.1%	52%
Entropy Modulation (exp(-H))	-	-1.8%	45%
Multi-Teacher Training	Single teacher	-1.5%	37%
Trajectory Replacement	Pure mixing	-0.8%	20%
Full OGER	-	+4.0%	100%

Individual Component Performance:

Divergence Reward Alone: +2.1% (strong signal)
Entropy Modulation Alone: +1.8% (prevents erratic exploration)
Multi-Teacher Alone: +1.5% (diverse references matter)
Trajectory Replacement Alone: +0.8% (maintains stability)

Non-Additive Gains: 2.1 + 1.8 + 1.5 + 0.8 = 6.2% > 4.0% total

Indicates negative interaction: Components partially overlap
Optimal combination selects most effective aspects
Integration creates efficiency gains

5.5 Sensitivity Analyses

Offline Data Quality Impact:

Offline Accuracy → OGER Improvement
99% (High quality)     → +4.2%
95% (Good quality)     → +3.8%
90% (Acceptable)       → +3.1%
80% (Low quality)      → +1.9%
70% (Very low)         → +0.6%

Interpretation: OGER's effectiveness degrades gracefully with offline data quality, maintaining benefits even with mediocre offline sources.

Number of Offline Teachers:

Teachers → OGER Improvement → Diversity Score
1         → +2.1%           → Low
2         → +3.2%           → Medium
3         → +3.9%           → High
4+        → +4.1%           → Very High (diminishing returns)

Batch Size Effects:

Batch Size → Computational Cost → Performance
32         → 1.0x baseline     → +3.8%
64         → 1.2x baseline     → +4.0%
128        → 1.5x baseline     → +4.1%
256        → 2.1x baseline     → +4.2%

Larger batches improve performance but with diminishing returns and increased cost.

6. Limitations and Open Questions

6.1 Computational Overhead

Analysis:

Embedding computation: O(NM) similarity scores
Encoder forward passes: O(N+M) trajectories
Memory for embeddings: (N+M)×d×4 bytes
Typical overhead: 15-30% additional training time

For practical scales (N=64, M=32, d=768):

Similarity matrix: 64×32 = 2,048 scores
Embedding storage: 96×768×4 ≈ 300 KB
Additional forward pass: ~100 ms on modern GPUs

6.2 Offline Data Dependency

Critical Dependency: OGER's effectiveness hinges on having high-quality offline demonstrations. In domains without good teachers:

Divergence signal becomes noisy
Entropy weighting alone insufficient
May underperform compared to pure online RL

Applicability Limitations:

Not suitable for novel domains without existing solutions
Requires domain experts or capable existing models
Transfer from related domains may be suboptimal

6.3 Embedding Space Design

Unresolved Questions:

Encoder Architecture: How sensitive is OGER to encoder design?
Similarity Metric: Why cosine similarity vs. alternatives?
Embedding Dimension: Is 768D sufficient or too large?
Joint Training: How does joint optimization of encoder + policy affect learning?

6.4 Theoretical Analysis Gaps

Missing from paper:

Formal convergence proofs
Sample complexity bounds
Conditions for when offline guidance helps/hurts
Rate of entropy preservation under OGER

6.5 Scalability to Very Large Models

Untested Regimes:

70B+ parameter models: Computational overhead becomes prohibitive
Training time scaling unclear
Memory requirements for embedding space uncertain
Benefit degrades? Improves? Unknown

7.1 vs. Offline-Only Methods (CQL, IQL)

Aspect	CQL/IQL	OGER
Exploration	Conservative, limited	Active, guided
Offline Data	Required for all learning	Reference only
Online Adaptation	Slow	Fast
Theory	Strong convergence proofs	Limited theory

7.2 vs. Entropy Regularization Methods

Aspect	Entropy-Only	OGER
Guidance	None (pure entropy)	Semantic divergence
Interpretability	Indirect entropy signal	Direct divergence signal
Offline Leverage	Ignored	Fully utilized
Stability	Can explore erratically	Stable via entropy weighting

7.3 vs. Offline Imitation (Luffy, Chord)

Aspect	Luffy/Chord	OGER
Adaptation	Limited online RL	Full online RL
Exploration	Minimal (imitation-focused)	Explicit exploration
Multiple Teachers	Difficult to integrate	Natural integration
Performance	Fast initial gains	Sustained improvements

7.4 Case Study: Why OGER Outperforms Baselines

Concrete Example - Mathematical Problem Solving:

Consider a complex geometry problem:

Vanilla GRPO: Quickly discovers one approach that works reliably, then reinforces minor variations of that approach. Converges rapidly but explores narrowly.
Entropy-Driven: Maintains high entropy but explores somewhat randomly, sometimes discovering useful variations but also many dead-ends.
Offline Imitation: Copies teacher approaches well, but transfers poorly when problem structure differs slightly.
OGER: Uses teacher trajectories as reference, seeks divergent solutions that verify correctly, maintains confidence in exploration. Discovers both variations of teacher's approach AND novel approaches.

Result: OGER finds a diverse solution set and develops adaptive strategies.

7.5 Practical Implementation Insights

Encoder Architecture in Practice: The encoder doesn't need to be large - a 2-4 layer transformer with ~100M parameters is sufficient:

Input: [token_ids for entire trajectory]
Layer 1: Self-attention over trajectory tokens
Layer 2: Self-attention with learned positional embeddings
Output: Mean-pooled final representation
Dimension: 768-1024D

Training Efficiency Tips:

Pre-compute offline embeddings before training (they're fixed)
Cache online embeddings for multiple uses in reward computation
Use mixed precision (FP16) for embedding computation
Batch similarity computations using matrix operations

Hyperparameter Tuning Guidance:

Entropy coefficient: Start with -1.0, adjust based on entropy evolution plots
Offline ratio (M/N): 0.5-0.75 works well (50-75% of batch from offline)
Embedding dimension: 768D usually sufficient, 1024D for very large vocabularies
Learning rate: Use 50% of policy learning rate for encoder

Common Pitfalls to Avoid:

Updating offline embeddings dynamically (causes instability)
Using wrong similarity metric (cosine >> L2 for trajectories)
Applying exploration rewards to offline trajectories (violates design principle)
Forgetting to normalize advantages before using in GRPO

8. Conclusion and Significance

8.1 Key Contributions Summarized

Conceptual Innovation: Reward modeling as integration point for offline-online learning
Technical Soundness: Entropy-aware divergence mechanism grounded in information theory
Practical Effectiveness: Consistent 4-7.9% improvements across multiple domains
Thorough Evaluation: Comprehensive ablations and sensitivity analyses

8.2 Impact and Significance

For Practitioners:

Clear, implementable method for leveraging offline teacher data
Compatible with existing GRPO/modern RL infrastructure
Reasonable computational overhead for performance gains

For Researchers:

Opens avenue for entropy-aware exploration mechanisms
Questions conventional separation of offline and online learning
Suggests reward modeling as promising integration point

For LLM Reasoning:

Concrete evidence that offline guidance + online exploration synergizes
Mechanism for preventing entropy collapse without explicit constraints
Potential blueprint for other hybrid learning scenarios

8.3 Future Research Directions

Adaptive Entropy Thresholds: Learn task-specific entropy sensitivities
Self-Play Offline Data: Use model-generated trajectories as pseudo-teachers
Hierarchical Similarity: Compute divergence at multiple scales
Domain-Specific Encoders: Customize embedding architecture per domain
Theoretical Analysis: Formal convergence and sample complexity bounds
Very Large Models: Efficient implementations for 70B+ parameters
Multi-Task Learning: Shared offline encoders across tasks

8.4 Final Assessment

OGER represents a meaningful advance in hybrid RL for LLM reasoning. The core innovation—using offline data as a semantic reference for computing exploration rewards—is elegant, well-executed, and empirically validated. While limitations exist (computational cost, offline data dependency, theory gaps), they don't undermine the core contribution.

The paper will likely influence future work in exploration-aware training, offline-online learning integration, and reward shaping for LLMs. The combination of conceptual clarity, technical sophistication, and empirical strength makes this a solid contribution to the reinforcement learning and NLP literature.

Appendix: Mathematical Notation Reference

Symbol	Meaning
τ	Complete trajectory or reasoning path
T_on	Set of online trajectories from current policy
T_off	Set of offline trajectories from teacher models
Enc	Trajectory embedding encoder
s_{i,j}	Similarity between online i and offline j
D_i	Divergence reward for trajectory i
H_i^last	Shannon entropy of final token distribution
R_i^m	Standard verifiable reward (0 or 1)
R_i^OGER	Exploration reward from OGER
R_i^total	Combined reward for optimization
A_i	Group-relative advantage estimate
π_θ	Policy with parameters θ

Shao et al. (2024): GRPO - Group Relative Policy Optimization
Guo et al. (2025): DeepSeek-R1 - RL for reasoning
Yan et al. (2025): Luffy - Offline-guided RL
Zhang et al. (2025a): Chord - Trajectory selection for offline RL
Cui et al. (2025b): Entropy collapse analysis
Wang et al. (2025c): Token-level entropy analysis

Comprehensive Technical Review by Zhongzhu Zhou
Date: April 26, 2026
Paper: OGER by Ma et al., ArXiv:2604.18530