Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

Beiwen Zhang^*, Yongheng Liang^*, Guowei Zou, Haitao Wang, Hejun Wu^†

Sun Yat-sen University

^* Equal contribution. ^† Corresponding author.

Paper Videos BibTeX

Selected Rollout Examples

Representative Co-π-tree rollouts on the five benchmark layouts, followed by matching human-AI collaboration clips.

Co-π-tree Policy Rollouts

One representative mp4 for each Overcooked-AI layout.

5 videos

Cramped Room

Fast specialization and dense handoffs in a compact kitchen.

Coordination Ring

Partner-aware movement around a circular shared workspace.

Counter Circuit

Intent readability matters when counters become bottlenecks.

Asymmetric Advantages

Different zones reward stable role decomposition.

Forced Coordination

Separated work areas demand complementary timing.

Human-AI Collaboration Clips

The same five layouts with a human partner interacting with Co-π-tree.

5 videos

Cramped Room

Human-AI handoffs in a high-contact layout.

Coordination Ring

Shared routing decisions with a human teammate.

Counter Circuit

Human-AI recovery around crowded counter flow.

Asymmetric Advantages

Role asymmetry carries over to mixed human-AI play.

Forced Coordination

Coordinated alternation across split workspaces.

TL;DR: Co-π-tree is a closed-loop method that learns an executable policy tree consisting of a partner-behavior prediction tree and an agent-action selection tree. It distills LLM reasoning into policy tree code, evaluates the policy through partner interaction, and uses natural language feedback to improve problematic branches.

The learned policy tree preserves coordination quality while reducing online LLM dependence at test time.

Overview

Existing zero-shot coordination methods often fall into two camps: MARL policies execute quickly but are difficult to interpret, while LLM-based agents expose language reasoning but query the model at every decision step. Co-π-tree uses LLMs to construct and revise the policy, then executes the learned tree directly.

MARL methods learn black-box policies; online LLM agents reason at every step; Co-π-tree revises explicit branches during learning and deploys the final policy tree directly.

Co-π-tree follows a closed-loop pipeline: policy construction, environment grounding, and policy refinement.

Environment

We evaluate in Overcooked-AI, a standard zero-shot coordination benchmark where two agents cooperate to cook and deliver soups under different layout constraints. The five layouts stress different collaboration skills, including fast handoff, bottleneck avoidance, role specialization, and forced interdependence.

The benchmark layouts are Cramped Room, Coordination Ring, Counter Circuit, Asymmetric Advantages, and Forced Coordination.

Key Numbers

+35.4%

average reward over the baseline average across AI-partner evaluations.

-77.7%

LLM query reduction relative to online LLM baselines.

-97.1%

test-time latency reduction after the policy tree is learned.

Citation

@misc{zhang2026distillingllmreasoninginterpretable,
      title={Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration}, 
      author={Beiwen Zhang and Yongheng Liang and Guowei Zou and Haitao Wang and Hejun Wu},
      year={2026},
      eprint={2606.08596},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.08596}, 
}