Method Overview

Co-π-tree addresses two linked obstacles in human-AI collaboration: opaque MARL policies and expensive online LLM decision making.

Challenge 1: Opacity

MARL policies are fast but hard to inspect, audit, or locally revise.

Co-π-tree: represents behavior as explicit branch-level rules.

Challenge 2: Latency

Online LLM agents repeatedly query the model during execution.

Co-π-tree: distills reasoning into code and executes it directly.

Challenge 3: Feedback

Sparse rewards do not reveal which local decision caused failure.

Co-π-tree: uses execution traces to revise problematic branches.

Co-pi-tree framework
The pipeline learns an interpretable policy through construction, environment grounding, and refinement.

Policy Construction

The planner generates a two-component policy tree. The partner-behavior prediction tree estimates the teammate's likely behavior, and the agent-action selection tree chooses the controlled agent's action conditioned on that prediction. A coder translates the textual tree into executable Python while preserving explicit if/elif branches.

Partner Prediction Tree

Predicts likely partner behavior from a grounded scene state, recent interaction history, and task progress.

Action Selection Tree

Chooses a complementary action conditioned on the same scene state and the predicted partner behavior.

Environment Grounding

Tensor states are converted into readable semantic fields describing task progress, object status, agent locations, and partner behavior. The executor maps high-level tree actions such as picking up onions or delivering soup into feasible movement and interaction commands.

Policy Refinement

Candidate policies are evaluated through partner interaction. The resulting trace records scene states, execution feasibility, predicted partner behavior, and realized partner behavior. A summarizer turns this feedback into localized reflection, allowing the planner to revise weak branches while preserving effective behavior elsewhere.

The key design choice is branch-level credit assignment: feedback identifies which local prediction or action branch failed, instead of asking the LLM to rewrite the whole policy.

Contributions

  • Interpretable policy-tree structure: partner prediction and action selection are represented as explicit decision branches.
  • Closed-loop distillation: LLM reasoning is distilled into executable code and improved through interaction feedback.
  • Efficient deployment: once learned, the policy tree runs without querying an LLM at every decision step.