Structure Detection for Contextual Reinforcement Learning

Massachusetts Institute of Technology
AAAI 2026

*Equal Contribution
Teaser Image


In this work, we introduce Structure Detection MBTL (SD-MBTL), a generic framework that dynamically identifies the underlying generalization structure of CMDP and selects an appropriate MBTL algorithm.

Abstract

Contextual Reinforcement Learning (CRL) tackles the problem of solving a set of related Contextual Markov Decision Processes (CMDPs) that vary across different context variables. Traditional approaches---independent training and multi-task learning---struggle with either excessive computational costs or negative transfer. A recently proposed multi-policy approach, Model-Based Transfer Learning (MBTL), has demonstrated effectiveness by strategically selecting a few tasks to train and zero-shot transfer. However, CMDPs encompass a wide range of problems, exhibiting structural properties that vary from problem to problem. As such, different task selection strategies are suitable for different CMDPs. In this work, we introduce Structure Detection MBTL (SD-MBTL), a generic framework that dynamically identifies the underlying generalization structure of CMDP and selects an appropriate MBTL algorithm. For instance, we observe Mountain structure in which generalization performance degrades from the training performance of the target task as the context difference increases. We thus propose M/GP-MBTL, which detects the structure and adaptively switches between a Gaussian Process-based approach and a clustering-based approach. Extensive experiments on synthetic data and CRL benchmarks—covering continuous control, traffic control, and agricultural management—show that M/GP-MBTL surpasses the strongest prior method by 12.49% on the aggregated metric. These results highlight the promise of online structure detection for guiding source task selection in complex CRL environments.

Structure Detection MBTL at a Glance

Real-world CRL deployments must juggle families of related tasks whose dynamics shift with payloads, weather, or traffic demand. Training a separate agent for every context wastes samples, yet a single multi-task policy often collapses under negative transfer once the context space grows. SD-MBTL instead asks whether the current CMDP exhibits a recognizable structure (e.g., the Mountain pattern where performance decays smoothly with context distance) and then picks the best source-task selection strategy on the fly. The rest of the page summarizes the math behind this detector and how it feeds into the training loop.

Generalization-Performance Decomposition

For each source–target pair we split the transfer return into source quality, target difficulty, and dissimilarity:

\[J(\pi_x, y) = f(x) + g(y) + h(x,y) + C. \]

After training on contexts \(x_{1:k}\) we estimate target difficulty by subtracting the empirical mean over trained policies, \(\overline{J}(\pi_x,y) = J(\pi_x,y) - \mathbb{E}_{x' \in x_{1:k}}[J(\pi_{x'},y)]\), and apply two detection criteria:

\[ \texttt{Mountain} \Longleftrightarrow \begin{cases} \operatorname{std}_{x} \overline{J}(\pi_x,x) < \mathbb{E}_{x}\big[\operatorname{std}_{y} \overline{J}(\pi_x,y)\big], \\ \text{sgn}(\theta_{\text{left}}^d) = \text{sgn}(\theta_{\text{right}}^d) \quad \forall d, \end{cases} \]

where \(\theta_{\text{left/right}}^d\) are slopes from a linear regression on the signed L1 context differences in dimension \(d\). Passing both tests means policy quality is almost constant and dissimilarity behaves like a distance metric, so the CMDP can be treated as Mountain. Otherwise training performance is heterogeneous and we rely on GP modeling.

Detect → Select Loop

Train: dedicate millions of PPO steps to one source context \(x_k\) to obtain a high-quality policy \(\pi_{x_k}\).

Evaluate: zero-shot transfer \(\pi_{x_k}\) to every target \(y\in Y\) to populate a new row of the transfer matrix.

Detect: update the statistics above, decide whether Mountain holds, and route training to the corresponding selector.

Each iteration of SD-MBTL therefore looks like:

  1. Update statistics. The fresh transfer row refines the estimates of \(f(x)\), \(g(y)\), and \(h(x,y)\), and we test the variance and slope criteria above.
  2. Decide the structure. Passing both tests tags the CMDP as Mountain; otherwise we assume heterogeneous policy quality and lean on GP modeling.
  3. Select the next task. We invoke the algorithm matched to the detected structure and add the chosen context to \(x_{1:k}\).

When Mountain holds, greedy source-task selection reduces to a sequential clustering problem with L1 loss:

\[ x_k = \arg\min_{x \in X} \mathbb{E}_{y \sim \mathcal{U}(Y)}\Big[\min_{x' \in x_{1:k-1} \cup \{x\}} \|x' - y\|_1\Big], \]

and we solve it via a K-means style update plus random restarts (M-MBTL). Otherwise we choose GP-MBTL: a Gaussian process regresses \(J(\pi_x,x)\), the learned slopes \(\theta_{\text{left/right}}\) produce gap estimates, and an upper-confidence acquisition balances exploration and exploitation. Both branches reuse the same transfer data but embody different inductive biases, providing a principled Detect → Select loop.

M-MBTL

M-MBTL

GP-MBTL

GP-MBTL

M/GP-MBTL

M/GP-MBTL

Benchmark Highlights

CartPole
BipedalWalker

CartPole. Even on this easy CMDP, M/GP-MBTL keeps pace with the best multi-task policy while training only a few experts.

BipedalWalker. Mountain structure dominates, so M/GP-MBTL routes to M-MBTL and clearly outperforms both random and multi-task training.

IntersectionZoo IntersectionZoo logo

IntersectionZoo. Target-task difficulty varies sharply, so the detector prefers GP-MBTL, yielding large gains over random or independent training while remaining competitive with the GP-only variant.

CyclesGym

CyclesGym. Another Mountain CMDP where clustering excels; SD-MBTL mirrors the M-MBTL performance and surpasses independent as well as multi-task baselines.

After min-max normalizing each benchmark between the random policy and the myopic oracle, M/GP-MBTL improves the aggregated score by 12.49% over the best prior MBTL method.

Aggregated results

BibTeX

@inproceedings{zhou2026structure,
        title={Structure Detection for Contextual Reinforcement Learning},
        author={Zhou, Tianyue and Cho, Jung-Hoon and Wu, Cathy},
        booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
        year={2026}
      }