0%

Low-Rank Optimization Trajectories for LLM RLVR Acceleration: A Technical Review of NExt

Low-Rank Optimization Trajectories for LLM RLVR Acceleration: A Technical Review of NExt

Review date: 2026-05-01
Paper reviewed: Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration
Paper authors: Zhipeng Chen, Tao Qian, Wayne Xin Zhao, Ji-Rong Wen
Affiliations: Renmin University of China; China University of Mining and Technology (Beijing)
arXiv: 2604.11446v1, 2026-04-13
Code reported by paper: https://github.com/RUCAIBox/NExt
Method name: NExt, short for Nonlinear Extrapolation of low-rank Trajectories
Source used for this review: src/related-documents/papers/2604.11446-LowRankLLMAcceleration.pdf


Short answer

This paper studies a concrete bottleneck in reinforcement learning with verifiable rewards (RLVR): RLVR improves reasoning models, but it is expensive because every optimization step requires sampling multiple model responses, checking them with a verifier, computing rewards, and then running a policy update.

The authors ask a more structural question:

If the model is following a recognizable parameter trajectory during RLVR, can we learn that trajectory and jump forward instead of executing every intermediate training step?

Their answer is NExt. NExt is not a separate paper title; it is the method proposed inside the paper. It works by:

  1. running an early segment of RLVR with LoRA;
  2. saving intermediate checkpoints;
  3. computing parameter differences between checkpoints;
  4. compressing each parameter-difference matrix to its dominant rank-1 SVD component;
  5. training a small predictor to map past low-rank trajectory information to a future low-rank update;
  6. extrapolating the model parameters;
  7. continuing RLVR from the extrapolated model.

The headline result is that NExt uses 250 RLVR steps instead of 400 in the main comparison, which is a 37.5% step reduction, and the paper reports similar or better accuracy on math reasoning benchmarks. In measured wall-clock terms, the paper reports 12.0h → 7.4h for Qwen2.5-1.5B-Instruct and 18.7h → 11.7h for Qwen2.5-3B-Instruct on a 4×A800 server, while the added SVD/predictor/extrapolation costs are small relative to RLVR itself.

The most interesting part is not just the speedup. The paper also provides evidence that the dominant rank-1 subspace of RLVR updates becomes stronger during training, especially under LoRA, but that this subspace does not evolve linearly. That combination explains why older linear extrapolation methods are useful but limited: low-rank structure is real, yet the motion inside that structure needs a nonlinear model.


1. Paper identity and terminology

There is an easy naming trap here, so let us pin it down first.

  • The paper title is Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration.
  • The method proposed by the paper is NExt.
  • NExt expands to Nonlinear Extrapolation of low-rank Trajectories.
  • NExt is a parameter-extrapolation method for RLVR training, not a new RL algorithm by itself.
  • In the experiments, NExt is mainly wrapped around GRPO, and the paper also tests it with RLOO and REINFORCE++.

A good one-sentence description is:

NExt accelerates RLVR by learning a nonlinear predictor over rank-1 parameter-update trajectories extracted from LoRA checkpoints, then using the predicted future update to skip part of training.

That sentence is more precise than saying only “NExt is a low-rank method,” because the paper’s actual claim has two pieces:

  1. low-rank compression: parameter updates can be approximated by dominant rank-1 SVD components;
  2. nonlinear trajectory modeling: the dominant components do not move linearly enough for a simple line fit.

2. Why RLVR is expensive

RLVR means reinforcement learning with verifiable rewards. It is common in reasoning-focused post-training because many reasoning tasks have answers that can be checked mechanically. Math is the cleanest example: the model writes a solution, extracts a final answer, and a verifier compares that answer with ground truth.

A simplified RLVR step looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
question x


current policy πθ samples G candidate solutions

├── ŷ1 → final answer â1 → verifier → reward R1
├── ŷ2 → final answer â2 → verifier → reward R2
├── ...
└── ŷG → final answer âG → verifier → reward RG


RL objective updates θ so high-reward solutions become more likely

The cost comes from the inner loop. For every prompt, the trainer does not generate one answer; it samples several rollouts. Those rollouts can be long, especially for reasoning models that produce chain-of-thought-like traces. Then the system computes reward signals and performs policy updates.

The paper formulates the RLVR dataset as:

D={xi,ai}i=1n,D = \{\langle x_i, a_i \rangle\}_{i=1}^{n},

where xix_i is a question and aia_i is its ground-truth answer. For a given question xx, a policy πθ\pi_\theta samples GG solutions:

{y^1,,y^G}πθ(x).\{\hat{y}_1, \ldots, \hat{y}_G\} \sim \pi_\theta(\cdot \mid x).

Each solution has a final answer a^i\hat{a}_i, and the verifier compares a^i\hat{a}_i with aia_i to produce a reward RiR_i.

The paper uses GRPO as the main RLVR algorithm. GRPO normalizes rewards within the group of responses for the same prompt. Its advantage estimate is:

Ai,1,,Ai,yi=Rimean(R1,,RG)std(R1,,RG).A_{i,1}, \ldots, A_{i,|y_i|} = \frac{R_i - \operatorname{mean}(R_1, \ldots, R_G)} {\operatorname{std}(R_1, \ldots, R_G)}.

The exact GRPO objective is not the focus of the review, but this normalization matters conceptually: the model learns from relative success among multiple sampled attempts. That is powerful, but it also means training cost scales with rollout count, response length, and the number of RL steps.

NExt targets the last factor: the number of RLVR training steps.


3. Why predicting full model parameters is too hard

A direct way to skip training would be:

Train for a few steps, look at the parameter trajectory, and predict the full future parameter vector.

For an LLM, that is unrealistic. Even a 1.5B-parameter model has far too many coordinates to predict naively, and the update dynamics are not smooth scalar curves. Different layers, projections, and adapters move at different speeds and directions.

For a weight matrix WRm×nW \in \mathbb{R}^{m \times n}, a full update matrix ΔW\Delta W contains m×nm \times n floating-point values. Predicting every entry directly would require a predictor whose input and output are enormous. It would also risk destroying the model by making small but correlated mistakes across many parameters.

NExt avoids this by asking a narrower question:

Can we predict the dominant low-rank structure of the update instead of every coordinate?

That is where SVD and rank-1 subspaces enter.


4. Prerequisite: SVD and rank-1 approximation

For a matrix WRm×nW \in \mathbb{R}^{m \times n}, singular value decomposition writes:

W=UΣV=i=1rσiuivi,W = U \Sigma V^\top = \sum_{i=1}^{r} \sigma_i u_i v_i^\top,

where:

  • rr is the rank of WW;
  • σi\sigma_i are singular values, usually sorted so σ1σ2\sigma_1 \geq \sigma_2 \geq \cdots;
  • uiu_i and viv_i are the corresponding left and right singular vectors;
  • σiuivi\sigma_i u_i v_i^\top is a rank-1 matrix.

The best rank-1 approximation keeps only the largest singular component:

W1=σ1u1v1.W_1 = \sigma_1 u_1 v_1^\top.

In this paper’s terminology, the subspace associated with σ1\sigma_1, u1u_1, and v1v_1 is the rank-1 subspace. The paper measures how much the first singular component dominates with:

E1=σ1i=1rσi.E_1 = \frac{\sigma_1}{\sum_{i=1}^{r}\sigma_i}.

The paper calls this an energy ratio. Strictly speaking, some fields define energy using squared singular values, but here the paper’s reported formula uses the singular-value share above. For this review, we follow the paper’s formula.

The compression is large. A full m×nm \times n matrix needs O(mn)O(mn) numbers. A rank-1 SVD component needs roughly m+n+1m+n+1 numbers:

1
2
full delta matrix:      ΔW ∈ R^{m×n}      → m·n numbers
rank-1 approximation: σ, u, v → 1 + m + n numbers

This is the reason low-rank trajectory modeling is attractive for LLM training acceleration.


5. Prerequisite: LoRA and why it matters here

LoRA, or low-rank adaptation, freezes the original weight matrix WW and trains a low-rank update:

h=Wix+ΔWix=Wix+BiAix,h = W_i x + \Delta W_i x = W_i x + B_i A_i x,

where:

  • WiRm×nW_i \in \mathbb{R}^{m \times n} is the base weight matrix;
  • BiRm×rB_i \in \mathbb{R}^{m \times r} and AiRr×nA_i \in \mathbb{R}^{r \times n} are trainable LoRA adapters;
  • rr is much smaller than mm or nn.

After training, the adapter can be merged back:

Wi,j=W0,j+Bi,jAi,j,W_{i,j} = W_{0,j} + B_{i,j}A_{i,j},

where jj indexes a parameter matrix and ii indexes a checkpoint.

NExt uses LoRA for a specific reason. The authors empirically find that LoRA makes the rank-1 subspace of parameter updates more dominant than full-parameter fine-tuning. If the downstream extrapolator only models rank-1 components, then a training mode that produces cleaner rank-1 structure is helpful.

So LoRA is not just a memory-saving implementation detail in this paper. It is part of the method’s geometry:

1
LoRA training → stronger rank-1 dominance → smaller approximation error → better extrapolation

6. Previous extrapolation methods and the linearity assumption

The paper positions NExt against prior parameter extrapolation methods such as AlphaRL and RL-Extra. The key distinction is not merely that NExt is low-rank. Prior methods also exploit parameter structure. The stronger distinction is:

  • prior RLVR extrapolation methods rely on linear extrapolation;
  • NExt learns a nonlinear predictor over low-rank trajectory components.

A linear extrapolator assumes something like:

ΔWt+kaΔWt+b\Delta W_{t+k} \approx a \Delta W_t + b

or, within the low-rank representation:

st+kast+b,s_{t+k} \approx a s_t + b,

where sts_t may stand for a singular value or singular-vector component.

This is simple and cheap. The problem is that RLVR optimization is not guaranteed to behave like a straight line. Reward feedback can change exploration patterns; policy updates can move the model into regions with different sampled responses; and LoRA adapters can rotate their dominant directions over time.

The paper’s empirical section exists to test exactly this assumption.


7. Empirical finding 1: rank-1 subspaces become more dominant during RLVR

The first empirical question is:

Does the rank-1 approximation become more or less reasonable as RLVR proceeds?

The authors track the rank-1 energy ratio of parameter updates during RLVR under two training settings:

  1. full-parameter fine-tuning;
  2. LoRA fine-tuning.

They report that, in the early stage of training, the rank-1 ratio gradually increases. That means the largest singular component explains a larger share of the parameter-update matrix. The effect is more pronounced under LoRA than under full-parameter fine-tuning.

Conceptually:

1
2
3
initial RLVR updates:    several directions matter
later RLVR updates: one dominant direction becomes clearer
LoRA RLVR updates: the dominant direction is clearer still

The paper’s Figure 2 shows this trend qualitatively. It does not need to prove that every update is exactly rank-1. The useful claim is weaker and more practical:

Rank-1 is a lossy approximation, but during LoRA-based RLVR it becomes good enough to use as the state representation for an extrapolator.

This is a meaningful observation because the method would be much less convincing if rank-1 components were unstable or explained only a tiny fraction of the updates.


8. Empirical finding 2: rank-1 trajectories are not reliably linear

The second empirical question is:

If rank-1 components matter, can we extrapolate them with a linear model?

The authors use the first 10 checkpoints of RLVR, fit least-squares predictors, and predict the rank-1 subspace for the next 5 checkpoints. They then compute R2R^2 values for parameter predictions across four model sizes.

Their key reported result is:

  • more than 50% of parameter updates have R2<0R^2 < 0;
  • a subset has R2<0.5R^2 < -0.5.

An R2<0R^2 < 0 is an important warning sign. It means the linear predictor is worse than simply predicting the mean. In plain language, many parameters are not just “a little nonlinear”; linear prediction is actively misleading for them.

A visual mental model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
linear extrapolation hopes for this:

update direction

│ • t+5
│ •
│ •
│ •
│•
└────────────────→ step

but the paper observes many trajectories closer to this:

update direction

│ • t+5
│ •
│ •
│ •
│ •
└────────────────→ step

This motivates the central design choice of NExt: keep the low-rank representation, but replace linear extrapolation with a learned nonlinear predictor.


9. NExt pipeline overview

NExt has three stages:

  1. extract low-rank trajectory data from early LoRA-based RLVR;
  2. model the trajectory using a predictor trained on global/local deltas;
  3. extrapolate model parameters and continue RLVR.

A compact workflow:

            
            flowchart LR
    A[Backbone model M0] --> B[LoRA-based RLVR for early steps]
    B --> C[Save intermediate checkpoints M1...Mc]
    C --> D[Compute global, local, and target deltas]
    D --> E[SVD each delta and keep rank-1 components]
    E --> F[Train nonlinear trajectory predictor]
    F --> G[Predict future rank-1 target delta]
    G --> H[Extend parameters: W_hat = W + alpha * DeltaW_hat]
    H --> I[Continue RLVR from extrapolated model]
          

The important point is that NExt does not claim “train 150 steps and stop.” It uses extrapolation as a jump, then runs additional RLVR after the jump. In the main setting:

1
2
3
4
5
6
first RLVR segment:       150 steps
checkpoint interval: every 10 steps → 15 checkpoints
predict horizon k: 5 checkpoints → 50 training steps
post-extrapolation RLVR: 100 steps
total NExt RLVR steps: 250 steps
main vanilla comparison: 400 steps

So the method is best understood as train → extrapolate → train, not as pure one-shot parameter prediction.


10. The three deltas: global, local, and target

For each saved checkpoint ii and each parameter matrix jj, the paper defines three parameter differences.

10.1 Global delta

The global delta measures how far the current checkpoint has moved from the backbone:

ΔWi,jG=Wi,jW0,j.\Delta W_{i,j}^{G} = W_{i,j} - W_{0,j}.

It answers:

Where are we relative to the starting model?

This captures the accumulated direction of training.

10.2 Local delta

The local delta measures the recent step-to-step movement:

ΔWi,jL=Wi,jWi1,j.\Delta W_{i,j}^{L} = W_{i,j} - W_{i-1,j}.

It answers:

What direction are we moving right now?

This captures local velocity.

10.3 Target delta

The target delta measures the future movement the predictor should learn:

ΔWi,jT=Wi+k,jWi,j.\Delta W_{i,j}^{T} = W_{i+k,j} - W_{i,j}.

It answers:

From checkpoint ii, what update would take us kk checkpoints into the future?

The predictor learns:

(ΔWi,jG,ΔWi,jL)ΔWi,jT.(\Delta W_{i,j}^{G}, \Delta W_{i,j}^{L}) \longrightarrow \Delta W_{i,j}^{T}.

Because full matrices are too large, NExt does this after SVD compression.


11. Rank-1 compression of the deltas

For each global, local, and target delta, NExt applies SVD and keeps the largest singular component:

ΔWi,jσi,jui,jvi,j.\Delta W_{i,j} \approx \sigma_{i,j} u_{i,j} v_{i,j}^{\top}.

After this step, the predictor does not need to output a full matrix. It predicts the rank-1 components:

  • the largest singular value σ\sigma;
  • the left singular vector uu;
  • the right singular vector vv.

The same idea applies to each delta type:

1
2
3
ΔW^G  →  (σ^G, u^G, v^G)
ΔW^L → (σ^L, u^L, v^L)
ΔW^T → (σ^T, u^T, v^T)

The training example becomes:

1
2
input:   rank-1(global delta), rank-1(local delta)
target: rank-1(target delta)

This is the paper’s core abstraction. It converts an impossible full-parameter prediction problem into many smaller regression problems over singular values and singular vectors.


12. The nonlinear trajectory predictor

The predictor is an encoder-decoder MLP. The paper simplifies notation by using ss to stand for one of the rank-1 components: uu, vv, or σ\sigma.

For global and local components sGs^G and sLs^L, the predictor computes:

hG=EG(sG),hL=EL(sL),h^G = E^G(s^G), \qquad h^L = E^L(s^L),

then concatenates the hidden states:

h=Concat(hG,hL),h = \operatorname{Concat}(h^G, h^L),

and decodes the target component:

s^T=D(h).\hat{s}^T = D(h).

So the predictor is not a Transformer and not another LLM. It is a lightweight regression model over compressed trajectory features.

The loss is an L1 loss:

LP(θP)=1ci=1cj=1lπθP(si,jG,si,jL)si,jT,L_P(\theta_P) = \frac{1}{c} \sum_{i=1}^{c}\sum_{j=1}^{l} \left|\pi_{\theta_P}(s_{i,j}^{G}, s_{i,j}^{L}) - s_{i,j}^{T}\right|,

where:

  • cc is the number of saved checkpoints;
  • ll is the number of parameter matrices;
  • πθP\pi_{\theta_P} is the trajectory predictor.

The authors state that they use L1 rather than L2 to avoid overly small gradients from L2 in this regression setting.


13. Predict-extend extrapolation

After the predictor is trained, NExt applies it to the last checkpoint. For each parameter matrix WW, it computes the latest global and local deltas, extracts their rank-1 components, predicts the target rank-1 update, reconstructs a rank-1 matrix:

ΔW^=σ^u^v^,\widehat{\Delta W} = \hat{\sigma}\hat{u}\hat{v}^{\top},

and extrapolates:

W^=W+αΔW^.\widehat{W} = W + \alpha \cdot \widehat{\Delta W}.

The scalar α\alpha is the extending coefficient. In the main experiments, the paper sets:

α=1.5.\alpha = 1.5.

The paper also adds two implementation details:

  1. Normalize predicted singular vectors. Since true SVD singular vectors have unit norm, predicted vectors should be normalized before reconstructing the update:

    s^T=πθP(si,jG,si,jL)πθP(si,jG,si,jL).\hat{s}^{T} = \frac{\pi_{\theta_P}(s_{i,j}^{G}, s_{i,j}^{L})} {\left|\pi_{\theta_P}(s_{i,j}^{G}, s_{i,j}^{L})\right|}.

  2. Concatenate same-dimensional singular vectors for efficiency. This lets GPU kernels process many small prediction tasks together.

In practice, the extrapolated model is not treated as final. It is a better starting point for the next RLVR segment.


14. Algorithm in plain pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Input:
backbone model M0
RLVR training set D
checkpoint count c
prediction horizon k
extending coefficient α

Stage 1: collect trajectory
run LoRA-based RLVR from M0
save checkpoints M1, ..., Mc

for each checkpoint i = 1 ... c-k:
for each parameter matrix j:
compute global delta: ΔW^G_{i,j} = W_{i,j} - W_{0,j}
compute local delta: ΔW^L_{i,j} = W_{i,j} - W_{i-1,j}
compute target delta: ΔW^T_{i,j} = W_{i+k,j} - W_{i,j}
SVD each delta and keep rank-1 components
add (rank1 global, rank1 local → rank1 target) to predictor data

Stage 2: train predictor
initialize MLP predictor πθP
train it with L1 loss to predict target rank-1 components

Stage 3: extrapolate
for each parameter matrix W in the latest checkpoint:
compute latest global and local deltas
SVD them and keep rank-1 components
predict future rank-1 target delta
reconstruct ΔW_hat = σ_hat u_hat v_hat^T
set W_hat = W + α ΔW_hat

Stage 4: continue RLVR
use extrapolated model M_hat as the starting point
run additional RLVR steps

15. Experimental setup

15.1 Training data and evaluation tasks

For RLVR training, the paper uses a dataset from prior work containing approximately 17k mathematical reasoning problems.

For evaluation, the main math tasks are:

  • AIME24;
  • AIME25;
  • AMC23;
  • Minerva;
  • OlymMATH easy.

The paper also evaluates cross-domain generalization on:

  • MMLU-Pro;
  • GPQA.

15.2 Models

The main experiments use four Qwen2.5-Instruct models:

  • Qwen2.5-1.5B-Instruct;
  • Qwen2.5-3B-Instruct;
  • Qwen2.5-7B-Instruct;
  • Qwen2.5-14B-Instruct.

The paper’s early rank-1 analysis also mentions Qwen and LLaMA models in Figure 2, but the main result tables shown in the paper are Qwen2.5.

15.3 Baselines

The paper compares against:

  • backbone model, no RLVR;
  • GRPO with full-parameter fine-tuning;
  • GRPO with LoRA fine-tuning;
  • AlphaRL;
  • RL-Extra;
  • NExt.

For algorithm generality, it also tests NExt with:

  • RLOO;
  • REINFORCE++.

15.4 Important hyperparameters

The paper reports the following main hyperparameters:

Process Hyperparameter Value
Train train batch size 128
Train mini batch size 32
Train number of rollouts 8
Train rollout temperature 1.0
Train rollout top-p 1.0
Train max prompt length 1024
Train max response length 4096
Train learning rate for full-parameter FT 5×1075 \times 10^{-7}
Train learning rate for LoRA 5×1065 \times 10^{-6}
Train LoRA rank 64
Train LoRA alpha 32
Test max response length 4096
Test temperature 1.0
Test top-p 1.0
Test repeated runs 8

NExt-specific settings:

Setting Value
backbone RLVR algorithm GRPO
early RLVR segment 150 steps
checkpoint interval every 10 steps
saved checkpoints 15
prediction horizon kk 5 checkpoints, i.e. 50 steps
extending coefficient α\alpha 1.5
additional RLVR after extrapolation 100 steps
total RLVR steps for NExt 250
main vanilla comparison 400 steps

16. Main math results: larger models

The table below copies the paper’s reported averages for Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct. ICER is lower-is-better.

Qwen2.5-7B-Instruct

Method Steps AIME24 AIME25 AMC23 Minerva OlymMATH Avg. ICER ↓
Backbone 10.0 5.4 51.9 22.9 5.3 19.1
GRPO w/ FP 250 13.8 11.7 59.1 24.9 6.0 23.1 62.5
GRPO w/ LoRA 250 13.3 10.8 56.3 24.9 5.0 22.1 83.3
GRPO w/ FP 400 16.3 11.3 59.7 25.9 6.8 24.0 81.6
GRPO w/ LoRA 400 16.7 11.7 59.4 24.7 5.0 23.5 90.9
AlphaRL 250 14.6 8.8 55.9 23.8 5.0 21.6 100.0
RL-Extra 250 15.4 8.8 58.8 24.9 5.8 22.7 69.4
NExt 250 16.4 12.5 60.3 25.5 6.5 24.2 49.0

For the 7B model, NExt’s average score is 24.2, slightly above 400-step full-parameter GRPO at 24.0, and above 400-step LoRA GRPO at 23.5, while using 250 steps.

Qwen2.5-14B-Instruct

Method Steps AIME24 AIME25 AMC23 Minerva OlymMATH Avg. ICER ↓
Backbone 12.1 9.2 51.3 26.1 5.4 20.8
GRPO w/ FP 250 13.3 14.2 65.9 29.4 8.8 26.3 45.5
GRPO w/ LoRA 250 14.2 15.0 62.5 29.7 7.8 25.8 50.0
GRPO w/ FP 400 17.1 17.5 66.3 29.0 8.8 27.7 58.0
GRPO w/ LoRA 400 16.7 13.3 65.0 31.3 8.8 27.0 64.5
AlphaRL 250 13.3 12.1 63.4 28.6 7.8 25.0 59.5
RL-Extra 250 15.4 14.6 63.1 30.2 7.6 26.2 46.3
NExt 250 17.9 16.7 67.2 30.5 9.3 28.3 33.3

For 14B, the result is stronger: NExt averages 28.3, higher than 400-step full-parameter GRPO at 27.7 and 400-step LoRA GRPO at 27.0.


17. Main math results: smaller models

For smaller models, the paper evaluates AMC23 and Minerva.

Method Steps 1.5B AMC23 1.5B Minerva 1.5B Avg. 1.5B ICER ↓ 3B AMC23 3B Minerva 3B Avg. 3B ICER ↓
Backbone 16.3 7.4 11.9 31.3 15.7 23.5
GRPO w/ FP 250 26.7 11.1 18.9 35.7 40.6 18.3 29.5 41.7
GRPO w/ LoRA 250 29.4 11.2 20.3 29.8 36.9 17.6 27.3 65.8
GRPO w/ FP 400 31.3 10.8 21.1 43.5 42.5 18.8 30.7 55.6
GRPO w/ LoRA 400 30.0 11.5 20.8 44.9 40.0 18.8 29.4 67.8
AlphaRL 250 27.5 11.8 19.7 32.1 39.4 17.3 28.4 51.0
RL-Extra 250 26.3 11.1 18.7 36.8 41.3 17.9 29.6 41.0
NExt 250 31.3 11.8 21.6 25.8 43.1 18.8 31.0 33.3

For 1.5B, NExt reaches 21.6, above the 400-step GRPO variants. For 3B, NExt reaches 31.0, again above the 400-step GRPO variants.


18. Compute results

The paper measures running time on a 4×A800 server for Qwen2.5-1.5B-Instruct and Qwen2.5-3B-Instruct.

Reported server usage:

Model Vanilla GRPO NExt Reduction
Qwen2.5-1.5B-Instruct 12.0 h 7.4 h about 37.5%
Qwen2.5-3B-Instruct 18.7 h 11.7 h about 37.5%

The paper emphasizes that the newly added components — SVD, predictor training, and extrapolation — are only a small portion of the total runtime. That matters because an acceleration method that saves RL steps but adds a huge side computation would not be useful. Here, the added modeling cost is small compared with rollout-heavy RLVR.


19. Ablation study

The ablation asks which pieces of NExt matter:

  • What if the early RLVR segment uses full-parameter fine-tuning instead of LoRA?
  • What if the predictor lacks the global delta?
  • What if the predictor lacks the local delta?

The reported pattern is consistent:

  1. removing LoRA hurts;
  2. removing global delta hurts;
  3. removing local delta hurts;
  4. after further RLVR, the weakened variants do not fully recover.

For Qwen2.5-1.5B-Instruct, after RLVR following extrapolation:

Method AMC23 Minerva Avg. ICER ↓
NExt 31.3 11.8 21.6 25.8
w/o LoRA 28.1 11.5 19.8 31.6
w/o G-Delta 28.8 10.8 19.8 31.6
w/o L-Delta 26.9 10.2 18.6 37.3

For Qwen2.5-3B-Instruct, after RLVR following extrapolation:

Method AMC23 Minerva Avg. ICER ↓
NExt 43.1 18.8 31.0 33.3
w/o LoRA 38.8 18.6 28.7 48.1
w/o G-Delta 38.8 17.6 28.2 53.2
w/o L-Delta 40.6 16.9 28.8 47.2

Interpretation:

  • LoRA gives the method a cleaner low-rank signal.
  • Global delta tells the predictor the accumulated destination so far.
  • Local delta tells the predictor the current motion.

Together, global and local deltas resemble position and velocity. Removing either makes trajectory prediction weaker.


20. Extending coefficient sensitivity

NExt has an extending coefficient α\alpha in:

W^=W+αΔW^.\widehat{W} = W + \alpha \widehat{\Delta W}.

The paper tests α\alpha values from 0.5 to 4.0. It reports that performance is relatively stable for:

α[0.5,2.5].\alpha \in [0.5, 2.5].

When α\alpha becomes too large, performance fluctuates. This is expected. Extrapolation is useful only while the predicted direction remains close enough to the true future trajectory. Pushing too far along even a good direction can overshoot.

A practical reading:

1
2
3
small α:    cautious jump, less speed gained
medium α: useful extrapolation zone
large α: overshoot risk and instability

The main paper setting α=1.5\alpha=1.5 sits inside the stable region.


21. Adaptation to other RLVR algorithms

The paper argues that NExt is orthogonal to the RLVR algorithm. To test this, it applies NExt to RLOO and REINFORCE++.

Reported averages:

Method Steps 1.5B Avg. 3B Avg.
Backbone 11.9 23.5
RLOO 250 13.7 25.4
RLOO 400 15.4 27.3
RLOO w/ NExt 250 17.5 28.5
REINFORCE++ 250 12.7 24.7
REINFORCE++ 400 15.6 26.5
REINFORCE++ w/ NExt 250 15.8 27.9

This supports the claim that NExt is not merely exploiting a GRPO-specific quirk. It still depends on having a usable RLVR trajectory, but the extrapolation logic can be placed around different RL objectives.


22. Other-domain results: MMLU-Pro and GPQA

The paper also evaluates MMLU-Pro and GPQA to test whether the method is only useful for the math tasks used in the main training setup.

For MMLU-Pro, the paper reports that NExt reaches performance comparable to 400-step GRPO while using 250 steps. A few examples from the reported average columns:

Model MMLU-Pro Part 1 Backbone GRPO NExt
Qwen2.5-1.5B 20.9 31.1 31.0
Qwen2.5-3B 44.0 48.2 48.8
Qwen2.5-7B 56.9 58.4 58.9
Qwen2.5-14B 66.1 70.5 70.5

For MMLU-Pro Part 2 averages:

Model Backbone GRPO NExt
Qwen2.5-1.5B 17.7 27.2 27.3
Qwen2.5-3B 38.1 40.5 40.4
Qwen2.5-7B 50.2 52.5 53.2
Qwen2.5-14B 60.8 62.6 62.5

For GPQA, the paper presents a cost/performance figure rather than a full numeric table in the text extract used here. The stated conclusion is that NExt uses fewer GPU hours and that the extra extrapolation cost is much smaller than the RLVR cost.


23. What the paper teaches about optimization geometry

The most valuable conceptual contribution is this pair of observations:

  1. RLVR updates have a dominant low-rank structure.
  2. The dominant structure evolves nonlinearly.

This is a useful middle ground. If updates were fully high-dimensional with no dominant structure, parameter extrapolation would be hopeless. If updates were low-rank and linear, older linear methods would be enough. The paper claims the actual situation is:

1
2
structured enough to compress,
but curved enough to require nonlinear prediction.

This has a broader implication for training systems. We often treat optimization as an opaque loop: run more steps, get a better checkpoint. NExt treats the sequence of checkpoints as data. The checkpoint trajectory itself becomes something to model.

That is a promising direction because many expensive training regimes have repeated structure:

  • RLVR post-training;
  • preference optimization;
  • domain adaptation;
  • curriculum training;
  • long-running supervised fine-tuning.

NExt is a specific implementation for RLVR, but the mindset is general: if a training process is expensive and its parameter motion is predictable, we may be able to learn shortcuts through the training trajectory.


24. Boundary conditions and limitations

The paper is promising, but there are important caveats.

24.1 Rank-1 may not be enough everywhere

The method depends on rank-1 approximation being informative. The paper shows that rank-1 dominance increases, especially with LoRA, but it does not prove that all important updates are rank-1. Some layers or tasks may require higher-rank motion.

A natural extension is to use rank-rr trajectories:

ΔWi=1rσiuivi,\Delta W \approx \sum_{i=1}^{r} \sigma_i u_i v_i^\top,

with r>1r>1. That would increase predictor cost but may improve fidelity.

24.2 The largest experiments are 14B

The paper tests models up to Qwen2.5-14B-Instruct. That is meaningful, but the strongest economic motivation for RLVR acceleration is at much larger scales. It remains to be shown how NExt behaves on 32B, 70B, 100B+, or mixture-of-experts models.

24.3 Main evidence is math-heavy

The core experiments are mathematical reasoning tasks. The paper includes MMLU-Pro and GPQA, which helps, but the method still needs validation on code, tool use, long-horizon agent tasks, multilingual reasoning, and alignment-oriented RL.

24.4 Extrapolation can overshoot

The α\alpha sweep shows that too-large extension coefficients create instability. This is not a flaw unique to NExt; it is intrinsic to extrapolation. The method needs monitoring and conservative hyperparameter choices.

24.5 Predictor architecture details are partly underspecified in the paper text

The paper gives the predictor form — MLP encoders, concatenation, MLP decoder, L1 loss — but the main text does not spell out every engineering detail such as hidden sizes, exact activation choices, optimizer settings for the predictor, or random seeds. The released code may contain these details, but the paper text alone is not a complete reproduction recipe.

24.6 Baseline reproduction matters

The paper compares against AlphaRL and RL-Extra. For production adoption, it would be important to check whether all baselines use similarly tuned LoRA/full-parameter settings, matching compute budgets, and comparable implementation quality.

24.7 Extrapolated parameters are not guaranteed safe

Any method that jumps in parameter space can potentially damage capabilities that are not measured by the chosen benchmark. The paper reports reasoning accuracy, but deployment would need extra evaluations for calibration, safety behavior, instruction following, and regression on general tasks.


25. Reproducibility notes

Positive reproducibility signals:

  • The paper reports a public code repository: https://github.com/RUCAIBox/NExt.
  • The RLVR training dataset size is specified at roughly 17k math problems.
  • The main model family and sizes are specified.
  • Major RLVR hyperparameters are listed.
  • Evaluation uses eight repeated runs.
  • Timing hardware is specified as 4×A800 for the resource comparison.
  • The checkpoint schedule and NExt-specific parameters are reported.

Items to verify before reproducing:

  • exact dataset version and preprocessing from the cited Yu et al. dataset;
  • exact reward/verifier implementation;
  • exact prompt templates and answer extraction rules;
  • predictor hidden sizes, activation functions, optimizer, and training epochs;
  • SVD implementation details for large matrices and LoRA-merged weights;
  • random seeds for RLVR sampling and predictor initialization;
  • whether baseline implementations are from official code or reimplemented;
  • whether reported wall-clock time includes data loading, evaluation, checkpoint saving, and model merging.

A practical reproduction checklist:

1
2
3
4
5
6
7
8
9
10
1. Start with Qwen2.5-1.5B-Instruct before scaling up.
2. Reproduce vanilla GRPO 250-step and 400-step results.
3. Reproduce LoRA GRPO with rank 64 and alpha 32.
4. Save checkpoints every 10 steps for the first 150 steps.
5. Compute global/local/target deltas with k=5.
6. Apply SVD and store rank-1 components.
7. Train the predictor with L1 loss.
8. Extrapolate with α=1.5.
9. Continue RLVR for 100 steps.
10. Evaluate with eight repeated runs and compare against the paper table.

26. How I would use this idea in practice

If I were implementing this in a training stack, I would treat NExt as an optional acceleration module, not as a replacement for RLVR.

Recommended cautious rollout:

  1. Run a normal LoRA RLVR baseline and save frequent checkpoints.
  2. Measure rank-1 energy ratios layer by layer.
  3. Only enable NExt for layers whose rank-1 dominance is high enough.
  4. Use a conservative α\alpha, such as 1.0 or 1.5.
  5. After extrapolation, run a short evaluation before continuing expensive RLVR.
  6. Keep a fallback checkpoint before the extrapolation jump.
  7. Track non-math regression benchmarks, not only the RLVR reward task.

I would also log per-layer prediction errors. Some layers may be easy to extrapolate, while others may be noisy. A layer-selective variant could be safer than applying the same mechanism everywhere.


27. Final assessment

This is a useful paper because it does three things at once:

  1. It studies the geometry of RLVR parameter updates instead of treating RLVR as a black box.
  2. It identifies a practical failure mode of linear extrapolation: many rank-1 trajectories are not linear.
  3. It proposes a concrete acceleration method that reports a meaningful 37.5% reduction in RLVR steps and wall-clock time while matching or improving benchmark accuracy.

The strongest claim is not “rank-1 explains everything.” It does not. The stronger and more defensible claim is:

During LoRA-based RLVR, rank-1 update structure becomes sufficiently dominant to serve as a compressed trajectory representation, and a nonlinear predictor can use that representation to skip part of training.

The main risks are scale, task coverage, and extrapolation safety. I would not assume the method works unchanged for every RLVR setting. But as a systems idea, it is compelling: if rollout-heavy RLVR is the expensive part, then learning to skip some rollout/update cycles is exactly the kind of optimization that matters.

For readers new to the area, the key takeaway is simple:

1
2
3
NExt does not make each RLVR step cheaper.
It tries to need fewer RLVR steps.
It does that by learning how the model's low-rank update direction bends over time.

That makes it a worthwhile direction for anyone working on efficient reasoning-model post-training.