Motivation

Why does SFT → RL improve reasoning?

SFT: imitation pressure

PromptGolden traceAnswer

Canonical demonstrations can solve the supervised task, but they may keep latent modules (skills, routing mechanisms) entangled inside a single golden trace.

RL: exploration plus reward

Prompt
Trace ATrace BTrace C
Reward

Reward-conditioned trajectory variation exposes local cues, valid choices, and interfaces that make latent modules identifiable.

Model and theory

Hierarchical latent selection model

Hierarchical latent selection model diagram

P induces discrete latent selections S ;
S generate the observed reasoning trace D .

Theory: identification -> composition

Theory: identification -> composition diagram

Experimental design

String transformations with controlled support

Synthetic tasks let us manipulate which modules and compositions are present during SFT and RL.

Experimental design for atomic skills, routing mechanisms, and out-of-distribution composition evaluation
Two module types: 24 skills (local operations) & 10 routing mechanisms (how intermediate results are selected & reused).
Evaluation: unseen compositions and deeper traces not observed during training. Composition depth L controls difficulty.

Results

Four findings from controlled SFT/RL data

Finding 1

RL decomposes traces into atoms and recombines them for generalization.

Table: Average accuracy advantage of RL over SFT.

Left figure: Much larger gains on unseen than seen compositions (L=Ltrain=3).

Right figure: Effect of RL data structure on OOD compound traces (L=4).

Training settingAccuracyGain
SFT baseline4.8%
SFT + RL, atomic modules14.8%+10.0
SFT + RL, compound traces42.6%+37.8
Compound advantage+27.8
Average accuracy advantage of RL over SFT by depth RL gains on seen and unseen compositions
Finding 2

Skill composability requires combinational exposure during RL.

When held-out skills are reintroduced, composed examples repair generalization better than atom-only examples.

Composability is learned at interfaces: the model must see components inside composed contexts, not only in isolation.

Skill composability curves under different reintroduction settings
Finding 3

Routing mechanisms are learned like skills.

Practical connection: many reasoning failures are missing ways to connect operations, not missing operations.

Routing mechanism learning curves for seen and unseen compositions
Finding 4

The distribution relationship between SFT and RL data matters.

Top figure: RL works best when it explores beyond SFT support.

Bottom figure: SFT should cover the atomic inventory through compositional traces.

Accuracy under different SFT and RL distribution relationships Accuracy when SFT covers atomic inventory through compositional traces

Citation

@misc{kong2026reasoningtracesreusablemodules,
      title={From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning}, 
      author={Lingjing Kong and Xin Liu and Guangyi Chen and Martin Q. Ma and Xiangchen Song and Yuekai Sun and Mikhail Yurochkin and Taylor W. Killian and Ruslan Salakhutdinov and Kun Zhang and Eric P. Xing and Zhengzhong Liu},
      year={2026},
      eprint={2606.18089},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.18089}, 
}