CompositionalReasoning | From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning

Motivation

Why does SFT → RL improve reasoning?

SFT: imitation pressure

PromptGolden traceAnswer

Canonical demonstrations can solve the supervised task, but they may keep latent modules (skills, routing mechanisms) entangled inside a single golden trace.

RL: exploration plus reward

Prompt

Trace ATrace BTrace C

Reward

Reward-conditioned trajectory variation exposes local cues, valid choices, and interfaces that make latent modules identifiable.

Model and theory

Hierarchical latent selection model

P induces discrete latent selections S ;
S generate the observed reasoning trace D .

Theory: identification -> composition

Experimental design

String transformations with controlled support

Synthetic tasks let us manipulate which modules and compositions are present during SFT and RL.

Experimental design for atomic skills, routing mechanisms, and out-of-distribution composition evaluation

Two module types: 24 skills (local operations) & 10 routing mechanisms (how intermediate results are selected & reused).
Evaluation: unseen compositions and deeper traces not observed during training. Composition depth L controls difficulty.

Results

Four findings from controlled SFT/RL data

Finding 1

RL decomposes traces into atoms and recombines them for generalization.

Table: Average accuracy advantage of RL over SFT.

Left figure: Much larger gains on unseen than seen compositions (L=L_train=3).

Right figure: Effect of RL data structure on OOD compound traces (L=4).

Training setting	Accuracy	Gain
SFT baseline	4.8%	—
SFT + RL, atomic modules	14.8%	+10.0
SFT + RL, compound traces	42.6%	+37.8
Compound advantage	—	+27.8

Average accuracy advantage of RL over SFT by depth

RL gains on seen and unseen compositions

Finding 2

Skill composability requires combinational exposure during RL.

When held-out skills are reintroduced, composed examples repair generalization better than atom-only examples.

Composability is learned at interfaces: the model must see components inside composed contexts, not only in isolation.

Skill composability curves under different reintroduction settings

Finding 3

Routing mechanisms are learned like skills.

Practical connection: many reasoning failures are missing ways to connect operations, not missing operations.

Routing mechanism learning curves for seen and unseen compositions

Finding 4

The distribution relationship between SFT and RL data matters.

Top figure: RL works best when it explores beyond SFT support.

Bottom figure: SFT should cover the atomic inventory through compositional traces.

Accuracy under different SFT and RL distribution relationships

Accuracy when SFT covers atomic inventory through compositional traces

Citation

@misc{kong2026reasoningtracesreusablemodules,
      title={From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning}, 
      author={Lingjing Kong and Xin Liu and Guangyi Chen and Martin Q. Ma and Xiangchen Song and Yuekai Sun and Mikhail Yurochkin and Taylor W. Killian and Ruslan Salakhutdinov and Kun Zhang and Eric P. Xing and Zhengzhong Liu},
      year={2026},
      eprint={2606.18089},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.18089}, 
}