Inside the engine

The built-in agent runs on a small, strict loop: the model picks tool calls, the engine runs them, feeds the results back, and asks again — over and over until the session closes itself. Everything else on this page is a rule that makes that loop survive a long run on a real phone. It’s the most internal page in these docs; you don’t need any of it to use PhysiClaw, but if you’re the kind of builder who wants to know why the agent behaves the way it does, here it is.

The loop

One session is one wake. The engine builds a system prompt and an opening message (the trigger that woke it), then spins this loop:

   ┌──────────────────────────────────────────────────────┐
   │                                                      │
   ▼                                                      │
 MODEL ─────► TOOL_CALLS ─────► DISPATCH ─────► TOOL_RESULTS
 picks tools  validated +       run each tool   appended to
 for this     shape-checked     (MCP or local)  the transcript
 turn                                                  │
   ▲                                                   │
   └───────────────────── ask again ──────────────────┘

There is no hidden orchestration script. The model decides what to do; the engine’s job is to run those decisions safely and hand back honest results. A tool that fails comes back as an error result the model can read and react to — never a crash. The transcript is always kept legal for the provider’s API: every tool call gets exactly one matching result, in the very next message.

This is a native tool-call loop — the structure rides on the provider’s real tools=[…] API and the model’s real tool_calls, not on the model hand-writing JSON into prose for the engine to parse. That distinction is the spine of the whole design.

How a session ends: the sentinel

The loop doesn’t stop because it hit some external timer. It stops when the model decides the work is over and closes the session itself, by calling end_session(status, recap). The status is one of five sentinel words, and the choice carries real consequences:

Status	Meaning	What happens next
`DONE`	Task complete.	Session ends. Final.
`FAIL`	Task impossible — sold out, account locked, against the rules.	Session ends. Final.
`WAIT`	Needs a human reply; the agent has stopped waiting in-session.	Session ends, then a follow-up is scheduled (see below).
`IDLE`	Nothing to do — the wake was spurious, no new message.	Session ends. Final.
`STUCK`	Unrecoverable mid-task: the phone won’t unlock, an app crashed, a loop ran away.	The engine retries with a fresh session.

Four of these are final the first time they happen. STUCK is the one that gets a second chance.

STUCK and the retry

STUCK is the engine’s word for “this attempt fell over” — and it’s raised not only when the model says so, but automatically when an attempt runs out of turns without a clean close, the provider exhausts its own retries, or the session crashes outright. Because a stuck session is often just bad luck — a transient glare on the camera, a slow load — the engine doesn’t give up. It throws the whole attempt away and starts a fresh session from the same trigger, up to 3 attempts total. A clean DONE/FAIL/WAIT/IDLE ends things on the first occurrence; only STUCK loops back.

WAIT and the auto follow-up

A WAIT means the agent messaged you and needs your answer before it can continue, so holding the loop open would waste tokens. It closes instead — but a closed session can’t wake itself. So WAIT pairs with a scheduled job to re-check later. If the agent forgets to schedule one, the engine notices and auto-schedules a generic 15-minute follow-up. That fallback is deliberately dumb; the right move is for the agent to schedule a check at the right delay (minutes for a quick reply, hours for an order to confirm), which is what the Autonomous tasks flow does.

The turn shape: `[note, one-other]`

Here’s the rule that most shapes how the agent behaves. Every single turn must call exactly two tools: one note, plus exactly one other tool. Not zero. Not three. The engine inspects each turn and, if the shape is wrong, rejects it and asks the model to re-issue — naming the offending tool so the model retries the same action rather than swapping in whatever happens to fit the shape.

Two separate constraints are bundled in that rule, and each earns its keep:

Exactly one note

note(summary=…) is a one-line, ≤20-word record of what this turn is doing and why. It’s mandatory because it’s the only part of a turn that survives compaction — once an old turn ages out, its screenshot and its tap are gone, but the note remains. The note is the agent’s permanent memory of its own reasoning.

Exactly one action

Capping a turn at one other tool forces the agent to act one step at a time — tap, then look; never tap-tap-tap blind. On a real phone where a wrong tap can send a message or place an order, “one action, then observe the result” is the whole safety story, enforced structurally instead of hoped for.

So a healthy turn looks like [note("opening the cart to check the total"), peek()] or [note("tapping checkout"), tap([0.41,0.88,0.59,0.94])]. The note narrates; the single action does the one thing.

Staying in context: compaction

A long task — a multi-screen grocery run — can take dozens of turns, and each peek carries a full screen image. Left alone, the transcript would blow past the model’s context window. The engine keeps it bounded with two tricks, both designed to drop bulk while preserving the decision trail.

Drop stale screens

Only the latest screen image is live. As soon as a newer peek or screenshot arrives, the engine reaches back and strips the image off every older one, replacing it with a small (superseded peek) stub plus the text rows from that old listing (the labels — “Checkout”, “¥39” — that stay useful as re-targetable anchors; the numbered icon boxes are meaningless without their image, so they’re dropped). The agent’s decisions stay intact — every note and every tool call it made — but it isn’t carrying ten redundant photos of screens it already moved past. Only the current view is in full.

Collapse old turns

Beyond a threshold of turns, the engine folds the oldest ones into a few compact slots near the top of the conversation:

the summaries — every aged-out turn’s note line, concatenated into a running list of what happened;
the memory loads — anything the agent pulled from memory.md or its logs, kept verbatim;
the skill loads — any SKILL.md workflow it opened, kept verbatim.

Notes are summarized; memory and skill loads are not, because those are durable reference material the agent deliberately loaded — a one-line summary can’t stand in for the workflow it’s following. By default the first fold happens around turn 30, always keeping the 10 most recent turns intact, and re-folding every 20 turns after. The exact numbers are tuned per provider so the fold lands where it costs the least against each vendor’s prompt cache.

The combined effect: a session can run long enough to finish a real, multi-step task, and the model still sees a coherent story — recent turns in full, older turns boiled down to the one line that mattered, and durable loads carried through whole.