Skip to content

How it works

You already know the brain half from OpenClaw: an agent reads a screen, decides what to do, and acts. PhysiClaw runs the same idea, but the action is a physical tap on a real phone. This page walks one full tap end to end so you can see how a camera frame becomes arm motion — and why it’s built as a closed, verified loop rather than blind teleoperation.

Every action PhysiClaw takes is one turn of the same four-phase loop. Keeping the loop fixed is what makes the system reliable: each phase has one job, and each ends by looking at the result. There is no open-loop “drive the arm and hope” — the system never trusts a move until it has seen the screen change.

┌──────────────────────────────────────────────────────────┐
│ │
▼ │
LOOK ─────────────► DECIDE ─────────► MOVE & TOUCH ──────► CHECK
overhead camera agent picks a arm drives stylus look again:
+ on-device vision box + a gesture to the box, taps did it change?
boxes every (tap / swipe / then lifts away │ │
element long-press) yes│ no│
│ │
next action ◄──┘ │
retry / re-aim ◄──────┘

The whole system is just one camera and one arm. There is no second camera and no depth sensor — reliability comes from re-looking after every move, not from fancier hardware.

PhysiClaw takes a photo of the screen with the overhead USB camera and runs it through on-device vision: RapidOCR for text and OmniParser V2 (a YOLO11m ONNX icon detector) for buttons, icons, and images. The output isn’t raw pixels; it’s a tidy listing where every element already has a label and a box:

id kind label bbox [left,top,right,bottom] conf
12 icon "Clock" [0.41, 0.55, 0.49, 0.63] 0.97
13 icon "Settings" [0.51, 0.55, 0.59, 0.63] 0.96
14 text "Wednesday 14" [0.30, 0.08, 0.70, 0.13] 0.99

A bbox (“bounding box”) is just a rectangle around one element, given as four numbers from 0 to 1 — fractions of the screen’s width and height, as [left, top, right, bottom]. [0.41, 0.55, 0.49, 0.63] means “start 41% across and 55% down, end 49% across and 63% down.” Using fractions instead of pixels is what keeps everything portable: the same listing makes sense on any phone size. That 01 fraction convention is load-bearing — the MOVE phase turns it straight into millimetres.

The full vision pipeline is covered in How it sees.

The agent — Claude, or any model you point PhysiClaw at (bring-your-own-model) — reads that listing and chooses one box plus one gesture. To open the clock, it picks element 12 and calls tap. It never deals in motor coordinates or pixels; its entire vocabulary is name a box, name an action.

The real gesture set is small and physical: tap, double_tap, long_press, swipe, plus navigation moves like home_screen, go_back, and force_quit. There are 12 tools in all — see the MCP tools reference for the complete list.

This is the half with no software analogue, so here’s the honest mechanical picture.

The orchestrator takes the center of the chosen bbox (still in 01 fractions) and converts it to real arm coordinates in millimetres using a calibrated affine map — a straight-line scale-plus-shift formula called pct_to_grbl. That map is exactly what calibration fits, by tapping a known grid and learning how screen fractions correspond to arm positions.

The arm itself is two motions:

  • X/Y is a hand-built CoreXY gantry — a from-scratch frame where two NEMA 17 stepper motors share the load to move the stylus across the screen plane. (CoreXY just means both motors pull on belts together to produce X and Y.) This is not a ready-made consumer pen plotter; it’s a gantry you build.
  • Z (the touch) is a 12V push-pull solenoid — a fast electromagnet with a ~10mm stroke, not a Z servo and not a plotter pen. The capacitive stylus tip threads onto the solenoid core. A tap is a crisp solenoid down-and-up; a long_press just holds the tip down longer (~1.2s).

The solenoid uses a hit-and-hold scheme: it strikes hard to make firm contact, then drops to a softer current to hold the tip in place without buzzing. The exact strike/hold values live in Gestures.

Two reliability behaviours are worth calling out:

  1. A busy lock. Every gesture holds a lock, so actions never overlap — the arm finishes one move before another can start.
  2. Auto-park. After each move the arm parks the stylus off-screen, so the next photo is unobstructed and the LOOK phase sees a clean frame.

PhysiClaw looks again and produces a fresh listing. The agent compares it to what it expected. There are three outcomes:

  • It changed as planned → move on to the next action.
  • Nothing changed → the tap missed or the box was wrong; re-aim and try again.
  • Something unexpected appeared (a popup, an ad, a permission prompt) → that’s simply the new state. The agent reads it and decides again — no brittle script to fall out of.

This observe the result, then decide again design is the whole reliability story. PhysiClaw recovers from surprises the same way a person would: by looking and trying once more. It’s the reason one camera and one arm are enough — the verification comes from re-looking, not from more sensors.

LOOK actually has two settings, and the agent picks based on what it needs:

peekscreenshot
Sourcethe overhead camerathe phone’s own capture
Speed~4s~12s
Sharpnessgood enough for most targetspixel-perfect
Side effectsnone (non-mutating)mutating — triggers the phone’s screenshot gesture, which apps can notice

peek is the default — fast and invisible. The agent only escalates to screenshot when a target is too small for the camera to resolve, or glare makes the frame unreadable.