Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

Beiwen Zhang*, Yongheng Liang*, Guowei Zou, Haitao Wang, Hejun Wu
Sun Yat-sen University
* Equal contribution. Corresponding author.

Selected Rollout Examples

Representative Co-π-tree rollouts on the five benchmark layouts, followed by matching human-AI collaboration clips.


TL;DR: Co-π-tree is a closed-loop method that learns an executable policy tree consisting of a partner-behavior prediction tree and an agent-action selection tree. It distills LLM reasoning into policy tree code, evaluates the policy through partner interaction, and uses natural language feedback to improve problematic branches.
The learned policy tree preserves coordination quality while reducing online LLM dependence at test time.

Overview

Existing zero-shot coordination methods often fall into two camps: MARL policies execute quickly but are difficult to interpret, while LLM-based agents expose language reasoning but query the model at every decision step. Co-π-tree uses LLMs to construct and revise the policy, then executes the learned tree directly.

Motivation for Co-pi-tree
MARL methods learn black-box policies; online LLM agents reason at every step; Co-π-tree revises explicit branches during learning and deploys the final policy tree directly.
Co-pi-tree framework
Co-π-tree follows a closed-loop pipeline: policy construction, environment grounding, and policy refinement.

Environment

We evaluate in Overcooked-AI, a standard zero-shot coordination benchmark where two agents cooperate to cook and deliver soups under different layout constraints. The five layouts stress different collaboration skills, including fast handoff, bottleneck avoidance, role specialization, and forced interdependence.

Five Overcooked-AI layouts
The benchmark layouts are Cramped Room, Coordination Ring, Counter Circuit, Asymmetric Advantages, and Forced Coordination.

Key Numbers

+35.4%

average reward over the baseline average across AI-partner evaluations.

-77.7%

LLM query reduction relative to online LLM baselines.

-97.1%

test-time latency reduction after the policy tree is learned.


Citation

@misc{zhang2026distillingllmreasoninginterpretable,
      title={Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration}, 
      author={Beiwen Zhang and Yongheng Liang and Guowei Zou and Haitao Wang and Hejun Wu},
      year={2026},
      eprint={2606.08596},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.08596}, 
}