Co-pi-tree: Results

Experimental Setup

We evaluate whether an interpretable policy tree can collaborate with unseen AI partners and real human partners while avoiding online LLM calls at every decision step.

The benchmark contains five Overcooked-AI layouts: Cramped Room, Coordination Ring, Counter Circuit, Asymmetric Advantages, and Forced Coordination. Each episode lasts 400 environment steps, and team reward is determined by delivered soups.

Overcooked-AI layouts used in the evaluation.

AI Partners

SP, PBT, FCP, MEP, COLE, and BC are used as held-out partner policies.

LLM Baselines

ProAgent and CausalPlan represent online LLM-based collaboration methods.

Variants

Co-π-tree is compared with variants that remove explicit partner use or refinement.

AI Partner Results

Q1: How effectively does Co-π-tree collaborate with unseen AI partners?

Answer: Co-π-tree outperforms all baselines in 8 of 10 layout-role settings and improves average reward by 35.4% over the baseline average, while reducing the number of LLM queries by 77.7% and test-time latency by 97.1% relative to the online LLM baselines.

Video-ZSC with AI Layout-by-layout qualitative comparisons across SP, COLE, ProAgent, and Co-π-tree. video.html

Table 1: ZSC With AI Partners

Mean team reward ± standard deviation. P0/P1 indicate which player role is controlled by the evaluated policy.

Layout / Role	SP	PBT	FCP	MEP	COLE	BC	ProAgent	CausalPlan	Co-π-tree
Cramped Rm. P0	155.0 ± 43	161.3 ± 45	175.8 ± 30	161.3 ± 42	153.8 ± 29	131.3 ± 37	171.0 ± 26	174.5 ± 20	182.3 ± 18
Cramped Rm. P1	157.5 ± 32	158.8 ± 53	170.3 ± 33	168.8 ± 35	153.8 ± 37	130.0 ± 37	168.3 ± 18	170.2 ± 16	180.1 ± 16
Coord. Ring P0	103.8 ± 49	118.8 ± 39	136.3 ± 28	152.0 ± 21	150.3 ± 32	96.3 ± 37	151.0 ± 34	155.7 ± 24	165.9 ± 28
Coord. Ring P1	127.5 ± 49	127.5 ± 37	137.5 ± 24	140.0 ± 39	144.0 ± 31	96.3 ± 41	143.3 ± 28	152.0 ± 26	162.1 ± 24
CT. Circuit P0	30.0 ± 33	43.8 ± 49	42.5 ± 45	53.8 ± 35	85.0 ± 30	47.5 ± 35	108.3 ± 24	108.8 ± 21	117.8 ± 15
CT. Circuit P1	36.3 ± 26	35.0 ± 39	37.5 ± 45	65.0 ± 41	86.3 ± 34	40.0 ± 35	106.0 ± 19	104.9 ± 21	116.0 ± 17
Asymm. Adv. P0	151.3 ± 60	147.5 ± 73	141.3 ± 68	116.3 ± 74	187.5 ± 46	150.0 ± 56	256.7 ± 31	250.4 ± 25	274.6 ± 31
Asymm. Adv. P1	190.0 ± 32	136.3 ± 76	170.0 ± 46	177.5 ± 59	150.0 ± 70	82.5 ± 75	228.3 ± 15	228.6 ± 20	239.1 ± 18
Forced Coord. P0	12.5 ± 17	21.3 ± 22	56.3 ± 42	23.8 ± 25	40.0 ± 32	40.0 ± 23	56.7 ± 22	64.3 ± 25	62.7 ± 31
Forced Coord. P1	28.8 ± 25	61.3 ± 42	18.8 ± 26	35.0 ± 32	41.3 ± 24	21.3 ± 22	33.3 ± 34	34.5 ± 28	35.6 ± 22

Query and latency comparison against ProAgent and CausalPlan. After learning, Co-π-tree executes the policy tree directly.

Human Partners

Q2: Does the policy tree preserve collaboration quality with real human partners?

Answer: Human-agent results show the full method has the highest mean reward on all five layouts among the full method and ablations, with the two player-role assignments merged into per-volunteer averages.

Video-ZSC with human Human-partner qualitative comparisons across five MARL methods and three Co-π-tree variants. video-zsc-human.html

Human-agent collaboration rewards by layout. Boxes aggregate per-volunteer averages over the two player-role assignments.

Ablation Studies

Q3: Which components matter: partner-behavior prediction, partner-conditioned action selection, and iterative refinement?

Table 2: Partner Prediction Ablation

Co-π-tree-PI keeps partner prediction as intermediate reasoning but does not explicitly condition action selection on the predicted behavior. Co-π-tree-w/o P removes partner prediction.

Layout / Role	Co-π-tree	Co-π-tree-PI	Co-π-tree-w/o P
Cramped Rm. P0	182.3 ± 18	176.7 ± 24	168.0 ± 25
Cramped Rm. P1	180.1 ± 16	176.0 ± 19	165.7 ± 22
Coord. Ring P0	165.9 ± 28	168.0 ± 26	158.0 ± 29
Coord. Ring P1	162.1 ± 24	169.3 ± 23	152.0 ± 25
CT. Circuit P0	117.8 ± 15	110.7 ± 16	106.0 ± 18
CT. Circuit P1	116.0 ± 17	114.0 ± 18	105.2 ± 21
Asymm. Adv. P0	274.6 ± 31	282.7 ± 35	266.7 ± 34
Asymm. Adv. P1	239.1 ± 18	242.0 ± 15	234.7 ± 22
Forced Coord. P0	62.7 ± 31	66.0 ± 24	62.0 ± 28
Forced Coord. P1	35.6 ± 22	44.1 ± 26	32.8 ± 23

Table 3: Iterative Refinement Ablation

Removing refinement uses only the initial prompt construction. The full loop usually repairs weak branches and improves reward.

Layout / Role	Co-π-tree	Co-π-tree w/o R
Cramped Rm. P0	182.3 ± 18	163.1 ± 16
Cramped Rm. P1	180.1 ± 16	161.5 ± 22
Coord. Ring P0	165.9 ± 28	153.8 ± 29
Coord. Ring P1	162.1 ± 24	148.3 ± 27
CT. Circuit P0	117.8 ± 15	104.0 ± 18
CT. Circuit P1	116.0 ± 17	100.7 ± 16
Asymm. Adv. P0	274.6 ± 31	260.3 ± 25
Asymm. Adv. P1	239.1 ± 18	226.6 ± 22
Forced Coord. P0	62.7 ± 31	66.8 ± 29
Forced Coord. P1	35.6 ± 22	35.8 ± 23

Refinement and Transfer

Additional experiments inspect how accepted rewards grow over refinement iterations and how policy trees transfer across layouts.

Accepted reward trajectories over 10 refinement iterations across five Overcooked-AI layouts.

Cross-layout transfer heatmaps for SP, ProAgent, and Co-π-tree, reported as normalized transfer retention.