Experimental Setup

We evaluate whether an interpretable policy tree can collaborate with unseen AI partners and real human partners while avoiding online LLM calls at every decision step.

The benchmark contains five Overcooked-AI layouts: Cramped Room, Coordination Ring, Counter Circuit, Asymmetric Advantages, and Forced Coordination. Each episode lasts 400 environment steps, and team reward is determined by delivered soups.

Five Overcooked-AI layouts
Overcooked-AI layouts used in the evaluation.

AI Partners

SP, PBT, FCP, MEP, COLE, and BC are used as held-out partner policies.

LLM Baselines

ProAgent and CausalPlan represent online LLM-based collaboration methods.

Variants

Co-π-tree is compared with variants that remove explicit partner use or refinement.


AI Partner Results

Q1: How effectively does Co-π-tree collaborate with unseen AI partners?

Answer: Co-π-tree outperforms all baselines in 8 of 10 layout-role settings and improves average reward by 35.4% over the baseline average, while reducing the number of LLM queries by 77.7% and test-time latency by 97.1% relative to the online LLM baselines.

Table 1: ZSC With AI Partners

Mean team reward ± standard deviation. P0/P1 indicate which player role is controlled by the evaluated policy.

Layout / Role SP PBT FCP MEP COLE BC ProAgent CausalPlan Co-π-tree
Cramped Rm. P0155.0 ± 43161.3 ± 45175.8 ± 30161.3 ± 42153.8 ± 29131.3 ± 37171.0 ± 26174.5 ± 20182.3 ± 18
Cramped Rm. P1157.5 ± 32158.8 ± 53170.3 ± 33168.8 ± 35153.8 ± 37130.0 ± 37168.3 ± 18170.2 ± 16180.1 ± 16
Coord. Ring P0103.8 ± 49118.8 ± 39136.3 ± 28152.0 ± 21150.3 ± 3296.3 ± 37151.0 ± 34155.7 ± 24165.9 ± 28
Coord. Ring P1127.5 ± 49127.5 ± 37137.5 ± 24140.0 ± 39144.0 ± 3196.3 ± 41143.3 ± 28152.0 ± 26162.1 ± 24
CT. Circuit P030.0 ± 3343.8 ± 4942.5 ± 4553.8 ± 3585.0 ± 3047.5 ± 35108.3 ± 24108.8 ± 21117.8 ± 15
CT. Circuit P136.3 ± 2635.0 ± 3937.5 ± 4565.0 ± 4186.3 ± 3440.0 ± 35106.0 ± 19104.9 ± 21116.0 ± 17
Asymm. Adv. P0151.3 ± 60147.5 ± 73141.3 ± 68116.3 ± 74187.5 ± 46150.0 ± 56256.7 ± 31250.4 ± 25274.6 ± 31
Asymm. Adv. P1190.0 ± 32136.3 ± 76170.0 ± 46177.5 ± 59150.0 ± 7082.5 ± 75228.3 ± 15228.6 ± 20239.1 ± 18
Forced Coord. P012.5 ± 1721.3 ± 2256.3 ± 4223.8 ± 2540.0 ± 3240.0 ± 2356.7 ± 2264.3 ± 2562.7 ± 31
Forced Coord. P128.8 ± 2561.3 ± 4218.8 ± 2635.0 ± 3241.3 ± 2421.3 ± 2233.3 ± 3434.5 ± 2835.6 ± 22
LLM query and latency comparison
Query and latency comparison against ProAgent and CausalPlan. After learning, Co-π-tree executes the policy tree directly.

Human Partners

Q2: Does the policy tree preserve collaboration quality with real human partners?

Answer: Human-agent results show the full method has the highest mean reward on all five layouts among the full method and ablations, with the two player-role assignments merged into per-volunteer averages.

Human-agent reward box plots
Human-agent collaboration rewards by layout. Boxes aggregate per-volunteer averages over the two player-role assignments.

Ablation Studies

Q3: Which components matter: partner-behavior prediction, partner-conditioned action selection, and iterative refinement?
Table 2: Partner Prediction Ablation

Co-π-tree-PI keeps partner prediction as intermediate reasoning but does not explicitly condition action selection on the predicted behavior. Co-π-tree-w/o P removes partner prediction.

Layout / Role Co-π-tree Co-π-tree-PI Co-π-tree-w/o P
Cramped Rm. P0182.3 ± 18176.7 ± 24168.0 ± 25
Cramped Rm. P1180.1 ± 16176.0 ± 19165.7 ± 22
Coord. Ring P0165.9 ± 28168.0 ± 26158.0 ± 29
Coord. Ring P1162.1 ± 24169.3 ± 23152.0 ± 25
CT. Circuit P0117.8 ± 15110.7 ± 16106.0 ± 18
CT. Circuit P1116.0 ± 17114.0 ± 18105.2 ± 21
Asymm. Adv. P0274.6 ± 31282.7 ± 35266.7 ± 34
Asymm. Adv. P1239.1 ± 18242.0 ± 15234.7 ± 22
Forced Coord. P062.7 ± 3166.0 ± 2462.0 ± 28
Forced Coord. P135.6 ± 2244.1 ± 2632.8 ± 23
Table 3: Iterative Refinement Ablation

Removing refinement uses only the initial prompt construction. The full loop usually repairs weak branches and improves reward.

Layout / Role Co-π-tree Co-π-tree w/o R
Cramped Rm. P0182.3 ± 18163.1 ± 16
Cramped Rm. P1180.1 ± 16161.5 ± 22
Coord. Ring P0165.9 ± 28153.8 ± 29
Coord. Ring P1162.1 ± 24148.3 ± 27
CT. Circuit P0117.8 ± 15104.0 ± 18
CT. Circuit P1116.0 ± 17100.7 ± 16
Asymm. Adv. P0274.6 ± 31260.3 ± 25
Asymm. Adv. P1239.1 ± 18226.6 ± 22
Forced Coord. P062.7 ± 3166.8 ± 29
Forced Coord. P135.6 ± 2235.8 ± 23

Refinement and Transfer

Additional experiments inspect how accepted rewards grow over refinement iterations and how policy trees transfer across layouts.
Reward growth across refinement rounds
Accepted reward trajectories over 10 refinement iterations across five Overcooked-AI layouts.
Layout transfer heatmap
Cross-layout transfer heatmaps for SP, ProAgent, and Co-π-tree, reported as normalized transfer retention.