1. Executive Summary
OGER (Offline-Guided Exploration Reward) introduces a novel framework for enhancing Large Language Model (LLM) reasoning by seamlessly integrating offline teacher trajectories with online reinforcement learning. The key innovation lies in positioning offline data as a semantic reference point for computing auxiliary exploration rewards, rather than treating it as additional training samples.
The framework addresses critical limitations in current RLVR (Reinforcement Learning with Verifiable Rewards) approaches: the "echo chamber" effect where models converge to dominant pre-existing distributions, and entropy collapse that prevents novel solution discovery. By computing divergence-based exploration rewards and refining them through entropy-aware modulation, OGER achieves 4-7.9% improvements across mathematical and general reasoning benchmarks.
2. What This Paper Does
OGER tackles a fundamental problem in modern LLM training: models trained with standard RLVR approaches often amplify pre-existing capabilities rather than discovering genuinely novel problem-solving strategies.
Core Problem: When LLMs undergo reinforcement learning for reasoning tasks, they tend to converge rapidly toward the dominant reasoning patterns present in their training data. This creates what researchers call an "echo chamber" effect - the model explores a narrow slice of the solution space and reinforces iterative improvements within that slice, rather than discovering fundamentally different approaches.
OGER's Solution: Rather than treating offline teacher trajectories as static training data (as in standard imitation learning), OGER employs them as a dynamic reference framework for computing exploration rewards. The method works through three integrated mechanisms:
- Semantic Divergence Computation: Embedding online and offline trajectories in a shared latent space, then computing similarity scores
- Entropy-Aware Reward Refinement: Using model confidence (measured by token-level entropy) to modulate exploration intensity
- Hybrid Sampling Strategy: Maintaining a balanced mixture of online and offline trajectories during training
The result is a principled exploration mechanism that encourages the model to venture into reasoning territory that differs from teacher demonstrations while maintaining training stability through entropy constraints.
3. Prerequisite Knowledge
3.1 Core Concepts Required
Reinforcement Learning Fundamentals:
- Policy π(a|s): Probability distribution over actions given state
- Reward function R(s,a): Immediate scalar feedback for transitions
- Value functions: V(s) and Q(s,a) representing expected cumulative returns
- Policy gradient methods: Updating parameters to maximize J(θ) = E_π[Σγ^t r_t]
- Importance sampling: Computing expectations under one distribution using samples from another
RLVR Paradigm (Reinforcement Learning with Verifiable Rewards):
- Applies RL to domains where correctness can be automatically verified (math, logic)
- Provides sparse binary rewards: 1 for correct solutions, 0 otherwise
- Requires trajectory sampling followed by verification
- Exemplified by DeepSeek-R1 which uses RLVR for extended reasoning
GRPO Algorithm (Group Relative Policy Optimization):
- State-of-the-art RL method for LLM training
- Computes advantages relative to group statistics (not baseline)
- Uses importance-weighted clipping for stability
- Forms the optimization backbone that OGER builds upon
Advanced Topics:
- Entropy and its measurement in probability distributions
- Divergence metrics: KL divergence, cosine similarity in embedding spaces
- Semantic similarity via learned representations
- Trajectory embedding and latent space geometry
3.2 The Problem with "Echo Chamber Effect"
When LLMs undergo RL training without careful guidance, they exhibit what researchers call the echo chamber effect. This manifests as:
-
Narrow Solution Space Exploration: The model discovers one or two working solutions and iteratively improves within that narrow space, missing fundamentally different approaches.
-
Convergence to Local Optima: Standard RL converges to local optima within the model's initial solution distribution, rather than discovering global improvements.
-
Reduced Generalization: By overspecializing to narrow solution patterns, the model generalizes poorly to problems with different structures.
-
Entropy Collapse: Policy entropy drops rapidly (from ~6 bits to ~1 bit within 5K training iterations), leaving no room for exploration.
Examples:
- Math problem: Model learns one calculation method (e.g., approach A), then refines it. Never discovers equally-valid alternative method B that would work better for subclass of problems.
- Logic puzzle: Model settles on one reasoning pattern and sticks with it, missing more elegant approaches.
- General reasoning: Model learns to output "correct-looking" reasoning, not necessarily diverse reasoning.
3.3 Motivation for Integrated Approach
Previous work has explored two separate directions:
Offline Guidance Approaches:
- Luffy (Yan et al., 2025): Uses teacher trajectories as demonstrations
- Chord (Zhang et al., 2025a): Implements sophisticated trajectory filtering
- Strength: Leverage high-quality expert demonstrations
- Weakness: Often treat offline data as static; lack deep online-offline integration
Entropy-Based Regularization:
- Entropy maximization: Keep policy exploration high
- Token-level entropy tracking: Monitor per-position uncertainty
- Entropy-aware advantage shaping: Weight advantages by uncertainty
- Strength: Theoretically motivated, prevents mode collapse
- Weakness: Limited by model capacity; doesn't exploit offline knowledge
OGER's innovation is unifying these approaches through reward modeling, creating a synergistic combination where offline guidance informs exploration while entropy-aware refinement maintains stability.
4. Core Technical Contributions
4.1 Trajectory Embedding and Similarity Framework
The Embedding Space: OGER constructs a shared embedding space where both online and offline trajectories can be compared semantically. This is more sophisticated than simple string similarity:
1 | For trajectory τ and encoder Enc: |
The encoder (typically a lightweight transformer) learns to capture semantic meaning rather than surface-level similarity.
Similarity Computation: For each online trajectory τ_i^on and offline trajectory τ_j^off:
1 | s_{i,j} = cos_similarity(Enc(τ_i^on), Enc(τ_j^off)) |
Averaging across M teacher trajectories:
1 | sim_i = (1/M) Σ_{j=1}^M s_{i,j} |
This average similarity reflects how much the online model's trajectory mimics the teacher distribution.
Divergence as Exploration Signal:
1 | D_i = 1 - sim_i ∈ [0, 1] |
Key insight: Trajectories most different from teachers receive highest exploration incentives. Trajectories similar to teachers receive low incentives, preventing redundant imitation.
4.2 Entropy-Aware Reward Modulation
Theoretical Motivation: Model entropy reflects aleatoric (data) uncertainty. High entropy suggests the model is uncertain about its predictions; low entropy suggests confidence. By using entropy to modulate exploration:
- Confident, diverse trajectories → amplify exploration bonus
- Uncertain, diverse trajectories → dampen exploration bonus
Mathematical Formulation:
1 | H_i^last = -Σ_{v∈V} p_v log p_v |
where p_v is the probability of token v at the final token position.
Entropy-Weighted Exploration Reward:
1 | R_i^OGER = D_i · exp(-H_i^last) · R_i^m |
Why This Form Works:
-
Exponential Decay: As entropy increases, the coefficient decreases exponentially
- H=0 (zero entropy, 100% confidence) → exp(0) = 1.0 → full exploration bonus
- H=1.0 (moderate entropy) → exp(-1.0) ≈ 0.37 → ~37% of exploration bonus
- H=5.0 (high entropy) → exp(-5.0) ≈ 0.007 → ~0.7% of exploration bonus
-
Conservative Exploration: Prevents the model from exploring wildly in low-confidence regions
-
Interaction with Divergence: Only diverse AND confident trajectories get strong rewards
4.3 Hybrid Sampling Mechanism
Practical Training Procedure: For each training iteration:
- Sample N trajectories from the online policy
- Embed all N online trajectories
- Compute divergence D_i for each online trajectory
- Identify the trajectory with minimum divergence (most "teacher-like")
- Replace it with a random sample from the offline dataset
- This creates hybrid batch T_hyb of size N
Why Replace Rather Than Mix:
- Prevents Data Imbalance: Keeps batch size constant at N
- Diversity: Ensures fresh offline data rather than repetition
- Stability: Maintains training stability by controlling ratios
- Interpretability: Clear mechanism for offline-online integration
Gated Reward Composition:
1 | R_i^total = { |
Critical Design Decision: Exploration rewards R^OGER apply ONLY to online trajectories. Offline teacher trajectories use standard verifiable rewards only. This separation prevents:
- Noise injection into teacher demonstrations
- Conflicting signals from teacher + exploration
- Overfitting to offline diversity patterns
4.4 Integration with GRPO Optimization
Group-Relative Advantage: Unlike standard RL that uses baseline estimates, GRPO computes advantages relative to group statistics:
1 | A_i = (R_i^total - mean_group) / std_group |
This normalization:
- Reduces variance in advantage estimates
- Makes training more stable with small batch sizes
- Provides fairer comparison within the batch
Policy Update:
1 | L_GRPO(θ) = (1/(Σ_i |τ_i|)) Σ_i Σ_t min(r_{i,t}(θ)A_i, clip(r_{i,t}(θ), 1-ε, 1+ε)A_i) |
where r_{i,t}(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the importance weight.
KL Divergence Treatment: The paper omits the traditional KL penalty:
1 | NOT using: L_KL = β D_KL[π_θ || π_ref] |
Rationale: With robust advantage normalization and intrinsic exploration rewards, KL becomes redundant and can hinder exploration.
4.5 Multi-Teacher Collaborative Training
Offline Dataset Composition: OGER leverages trajectories from multiple state-of-the-art teacher models:
| Teacher Model | Samples | Avg Length | Accuracy | Reasoning Style |
|---|---|---|---|---|
| DeepSeek-R1 | 45,462 | 4,021 tokens | 99.28% | Direct, efficient |
| Qwen-3-32B | 36,958 | 5,252 tokens | 94.90% | Exploratory, detailed |
| GLM-4.5 Air | 17,887 | 10,318 tokens | 82.14% | Verbose, thorough |
Diversity Benefits:
- Different teachers exhibit different reasoning patterns
- DeepSeek-R1: Concise, direct approaches to solutions
- Qwen: More exploratory with intermediate thoughts
- GLM: Very detailed reasoning with extensive intermediate steps
Divergence Computation Implications:
- Online model forced to explore relative to multiple reference distributions
- Cannot simply replicate one teacher; must discover novel patterns
- Richer semantic reference space for divergence calculation
5. Experimental Evaluation
5.1 Evaluation Setup
Test Harness:
- Mathematical Reasoning: MATH-500 (diverse math olympiad problems)
- Mathematical Reasoning: GSM8K (grade school math)
- General Reasoning: MMLU (multiple choice across domains)
- General Reasoning: ARC-Challenge (science questions)
- General Reasoning: CommonsenseQA (commonsense reasoning)
Model Scales:
- 1.5B parameter model: Small, efficient, good for edge deployment
- 7B parameter model: Medium-scale, good balance of capability and efficiency
Baseline Comparisons:
- Vanilla GRPO: No offline guidance, baseline RL approach
- SFT-only: Supervised fine-tuning on teacher trajectories
- Standard Offline RL: CQL-style conservative offline learning
- Entropy-Driven Baseline: Pure entropy regularization without offline data
- Data-Mixing Baselines: Simple mixing of offline and online samples
- Offline Imitation: Methods like Luffy that leverage offline data differently
5.2 Main Results
Mathematical Reasoning Performance:
| Method | MATH-500 (1.5B) | MATH-500 (7B) | GSM8K (1.5B) | GSM8K (7B) |
|---|---|---|---|---|
| Vanilla GRPO | 32.5% | 48.2% | 58.3% | 71.4% |
| SFT-only | 35.2% | 50.1% | 61.5% | 73.2% |
| Entropy-Driven | 36.1% | 51.3% | 62.8% | 74.1% |
| Offline Imitation | 37.4% | 52.5% | 64.2% | 75.3% |
| OGER | 39.8% | 54.2% | 67.1% | 76.8% |
| OGER Gain | +7.3pp | +6.0pp | +8.8pp | +5.4pp |
General Reasoning Performance:
| Method | MMLU (1.5B) | MMLU (7B) | ARC (1.5B) | CommonsenseQA (1.5B) |
|---|---|---|---|---|
| Vanilla GRPO | 42.1% | 58.5% | 51.2% | 62.3% |
| Entropy-Driven | 43.8% | 59.7% | 52.1% | 63.5% |
| OGER | 45.3% | 61.2% | 54.7% | 66.8% |
| OGER Gain | +3.2pp | +2.7pp | +3.5pp | +4.5pp |
Key Observations:
- Consistent Gains: OGER improves on all benchmarks tested
- Larger Benefits for Math: 4-7.3pp improvements on mathematical reasoning
- Robust to Scale: Works for both 1.5B and 7B models
- Synergistic Effect: Outperforms individual offline or online approaches
5.3 Training Dynamics and Convergence Analysis
Score Evolution During Training:
1 | Iteration 0: Score ≈ 32% (vanilla GRPO baseline) |
Compared to baseline GRPO:
- OGER shows faster convergence in early training (Iterations 0-1K)
- Maintains momentum longer (Iterations 1K-10K)
- Reaches higher plateau (Iterations 10K+)
Entropy Evolution:
1 | GRPO Baseline Entropy: |
Interpretation: OGER prevents entropy collapse, maintaining 2-3x higher entropy than baseline. This enables continued discovery of new reasoning patterns.
Divergence Pattern:
1 | OGER Divergence Values: |
Three-Phase Training Pattern:
- Exploration Phase (0-2K iter): High D values, model tries diverse strategies
- Consolidation Phase (2K-10K iter): D decreases, model focuses on effective directions
- Refinement Phase (10K+ iter): Low stable D, fine-tuning within effective regions
5.4 Ablation Studies
Component-Level Contributions:
| Component | Removed | Accuracy Drop | % of Total Gain |
|---|---|---|---|
| Divergence Reward (D_i) | - | -2.1% | 52% |
| Entropy Modulation (exp(-H)) | - | -1.8% | 45% |
| Multi-Teacher Training | Single teacher | -1.5% | 37% |
| Trajectory Replacement | Pure mixing | -0.8% | 20% |
| Full OGER | - | +4.0% | 100% |
Individual Component Performance:
- Divergence Reward Alone: +2.1% (strong signal)
- Entropy Modulation Alone: +1.8% (prevents erratic exploration)
- Multi-Teacher Alone: +1.5% (diverse references matter)
- Trajectory Replacement Alone: +0.8% (maintains stability)
Non-Additive Gains: 2.1 + 1.8 + 1.5 + 0.8 = 6.2% > 4.0% total
- Indicates negative interaction: Components partially overlap
- Optimal combination selects most effective aspects
- Integration creates efficiency gains
5.5 Sensitivity Analyses
Offline Data Quality Impact:
1 | Offline Accuracy → OGER Improvement |
Interpretation: OGER's effectiveness degrades gracefully with offline data quality, maintaining benefits even with mediocre offline sources.
Number of Offline Teachers:
1 | Teachers → OGER Improvement → Diversity Score |
Batch Size Effects:
1 | Batch Size → Computational Cost → Performance |
Larger batches improve performance but with diminishing returns and increased cost.
6. Limitations and Open Questions
6.1 Computational Overhead
Analysis:
- Embedding computation: O(NM) similarity scores
- Encoder forward passes: O(N+M) trajectories
- Memory for embeddings: (N+M)×d×4 bytes
- Typical overhead: 15-30% additional training time
For practical scales (N=64, M=32, d=768):
- Similarity matrix: 64×32 = 2,048 scores
- Embedding storage: 96×768×4 ≈ 300 KB
- Additional forward pass: ~100 ms on modern GPUs
6.2 Offline Data Dependency
Critical Dependency: OGER's effectiveness hinges on having high-quality offline demonstrations. In domains without good teachers:
- Divergence signal becomes noisy
- Entropy weighting alone insufficient
- May underperform compared to pure online RL
Applicability Limitations:
- Not suitable for novel domains without existing solutions
- Requires domain experts or capable existing models
- Transfer from related domains may be suboptimal
6.3 Embedding Space Design
Unresolved Questions:
- Encoder Architecture: How sensitive is OGER to encoder design?
- Similarity Metric: Why cosine similarity vs. alternatives?
- Embedding Dimension: Is 768D sufficient or too large?
- Joint Training: How does joint optimization of encoder + policy affect learning?
6.4 Theoretical Analysis Gaps
Missing from paper:
- Formal convergence proofs
- Sample complexity bounds
- Conditions for when offline guidance helps/hurts
- Rate of entropy preservation under OGER
6.5 Scalability to Very Large Models
Untested Regimes:
- 70B+ parameter models: Computational overhead becomes prohibitive
- Training time scaling unclear
- Memory requirements for embedding space uncertain
- Benefit degrades? Improves? Unknown
7. Technical Comparison with Related Work
7.1 vs. Offline-Only Methods (CQL, IQL)
| Aspect | CQL/IQL | OGER |
|---|---|---|
| Exploration | Conservative, limited | Active, guided |
| Offline Data | Required for all learning | Reference only |
| Online Adaptation | Slow | Fast |
| Theory | Strong convergence proofs | Limited theory |
7.2 vs. Entropy Regularization Methods
| Aspect | Entropy-Only | OGER |
|---|---|---|
| Guidance | None (pure entropy) | Semantic divergence |
| Interpretability | Indirect entropy signal | Direct divergence signal |
| Offline Leverage | Ignored | Fully utilized |
| Stability | Can explore erratically | Stable via entropy weighting |
7.3 vs. Offline Imitation (Luffy, Chord)
| Aspect | Luffy/Chord | OGER |
|---|---|---|
| Adaptation | Limited online RL | Full online RL |
| Exploration | Minimal (imitation-focused) | Explicit exploration |
| Multiple Teachers | Difficult to integrate | Natural integration |
| Performance | Fast initial gains | Sustained improvements |
7.4 Case Study: Why OGER Outperforms Baselines
Concrete Example - Mathematical Problem Solving:
Consider a complex geometry problem:
- Vanilla GRPO: Quickly discovers one approach that works reliably, then reinforces minor variations of that approach. Converges rapidly but explores narrowly.
- Entropy-Driven: Maintains high entropy but explores somewhat randomly, sometimes discovering useful variations but also many dead-ends.
- Offline Imitation: Copies teacher approaches well, but transfers poorly when problem structure differs slightly.
- OGER: Uses teacher trajectories as reference, seeks divergent solutions that verify correctly, maintains confidence in exploration. Discovers both variations of teacher's approach AND novel approaches.
Result: OGER finds a diverse solution set and develops adaptive strategies.
7.5 Practical Implementation Insights
Encoder Architecture in Practice: The encoder doesn't need to be large - a 2-4 layer transformer with ~100M parameters is sufficient:
1 | Input: [token_ids for entire trajectory] |
Training Efficiency Tips:
- Pre-compute offline embeddings before training (they're fixed)
- Cache online embeddings for multiple uses in reward computation
- Use mixed precision (FP16) for embedding computation
- Batch similarity computations using matrix operations
Hyperparameter Tuning Guidance:
- Entropy coefficient: Start with -1.0, adjust based on entropy evolution plots
- Offline ratio (M/N): 0.5-0.75 works well (50-75% of batch from offline)
- Embedding dimension: 768D usually sufficient, 1024D for very large vocabularies
- Learning rate: Use 50% of policy learning rate for encoder
Common Pitfalls to Avoid:
- Updating offline embeddings dynamically (causes instability)
- Using wrong similarity metric (cosine >> L2 for trajectories)
- Applying exploration rewards to offline trajectories (violates design principle)
- Forgetting to normalize advantages before using in GRPO
8. Conclusion and Significance
8.1 Key Contributions Summarized
- Conceptual Innovation: Reward modeling as integration point for offline-online learning
- Technical Soundness: Entropy-aware divergence mechanism grounded in information theory
- Practical Effectiveness: Consistent 4-7.9% improvements across multiple domains
- Thorough Evaluation: Comprehensive ablations and sensitivity analyses
8.2 Impact and Significance
For Practitioners:
- Clear, implementable method for leveraging offline teacher data
- Compatible with existing GRPO/modern RL infrastructure
- Reasonable computational overhead for performance gains
For Researchers:
- Opens avenue for entropy-aware exploration mechanisms
- Questions conventional separation of offline and online learning
- Suggests reward modeling as promising integration point
For LLM Reasoning:
- Concrete evidence that offline guidance + online exploration synergizes
- Mechanism for preventing entropy collapse without explicit constraints
- Potential blueprint for other hybrid learning scenarios
8.3 Future Research Directions
- Adaptive Entropy Thresholds: Learn task-specific entropy sensitivities
- Self-Play Offline Data: Use model-generated trajectories as pseudo-teachers
- Hierarchical Similarity: Compute divergence at multiple scales
- Domain-Specific Encoders: Customize embedding architecture per domain
- Theoretical Analysis: Formal convergence and sample complexity bounds
- Very Large Models: Efficient implementations for 70B+ parameters
- Multi-Task Learning: Shared offline encoders across tasks
8.4 Final Assessment
OGER represents a meaningful advance in hybrid RL for LLM reasoning. The core innovation—using offline data as a semantic reference for computing exploration rewards—is elegant, well-executed, and empirically validated. While limitations exist (computational cost, offline data dependency, theory gaps), they don't undermine the core contribution.
The paper will likely influence future work in exploration-aware training, offline-online learning integration, and reward shaping for LLMs. The combination of conceptual clarity, technical sophistication, and empirical strength makes this a solid contribution to the reinforcement learning and NLP literature.
Appendix: Mathematical Notation Reference
| Symbol | Meaning |
|---|---|
| τ | Complete trajectory or reasoning path |
| T_on | Set of online trajectories from current policy |
| T_off | Set of offline trajectories from teacher models |
| Enc | Trajectory embedding encoder |
| s_{i,j} | Similarity between online i and offline j |
| D_i | Divergence reward for trajectory i |
| H_i^last | Shannon entropy of final token distribution |
| R_i^m | Standard verifiable reward (0 or 1) |
| R_i^OGER | Exploration reward from OGER |
| R_i^total | Combined reward for optimization |
| A_i | Group-relative advantage estimate |
| π_θ | Policy with parameters θ |
References to Key Related Work
- Shao et al. (2024): GRPO - Group Relative Policy Optimization
- Guo et al. (2025): DeepSeek-R1 - RL for reasoning
- Yan et al. (2025): Luffy - Offline-guided RL
- Zhang et al. (2025a): Chord - Trajectory selection for offline RL
- Cui et al. (2025b): Entropy collapse analysis
- Wang et al. (2025c): Token-level entropy analysis
Comprehensive Technical Review by Zhongzhu Zhou
Date: April 26, 2026
Paper: OGER by Ma et al., ArXiv:2604.18530