How it works
You already know the brain half from OpenClaw: an agent reads a screen, decides what to do, and acts. PhysiClaw runs the same idea, but the action is a physical tap on a real phone. This page walks one full tap end to end so you can see how a camera frame becomes arm motion — and why it’s built as a closed, verified loop rather than blind teleoperation.
Every action PhysiClaw takes is one turn of the same four-phase loop. Keeping the loop fixed is what makes the system reliable: each phase has one job, and each ends by looking at the result. There is no open-loop “drive the arm and hope” — the system never trusts a move until it has seen the screen change.
┌──────────────────────────────────────────────────────────┐ │ │ ▼ │ LOOK ─────────────► DECIDE ─────────► MOVE & TOUCH ──────► CHECK overhead camera agent picks a arm drives stylus look again: + on-device vision box + a gesture to the box, taps did it change? boxes every (tap / swipe / then lifts away │ │ element long-press) yes│ no│ │ │ next action ◄──┘ │ retry / re-aim ◄──────┘The whole system is just one camera and one arm. There is no second camera and no depth sensor — reliability comes from re-looking after every move, not from fancier hardware.
1. Look
Section titled “1. Look”PhysiClaw takes a photo of the screen with the overhead USB camera and runs it through on-device vision: RapidOCR for text and OmniParser V2 (a YOLO11m ONNX icon detector) for buttons, icons, and images. The output isn’t raw pixels; it’s a tidy listing where every element already has a label and a box:
id kind label bbox [left,top,right,bottom] conf12 icon "Clock" [0.41, 0.55, 0.49, 0.63] 0.9713 icon "Settings" [0.51, 0.55, 0.59, 0.63] 0.9614 text "Wednesday 14" [0.30, 0.08, 0.70, 0.13] 0.99A bbox (“bounding box”) is just a rectangle around one element, given as four numbers from
0 to 1 — fractions of the screen’s width and height, as [left, top, right, bottom].
[0.41, 0.55, 0.49, 0.63] means “start 41% across and 55% down, end 49% across and 63% down.”
Using fractions instead of pixels is what keeps everything portable: the same listing makes
sense on any phone size. That 0–1 fraction convention is load-bearing — the MOVE phase turns
it straight into millimetres.
The full vision pipeline is covered in How it sees.
2. Decide
Section titled “2. Decide”The agent — Claude, or any model you point PhysiClaw at (bring-your-own-model) — reads that
listing and chooses one box plus one gesture. To open the clock, it picks element 12
and calls tap. It never deals in motor coordinates or pixels; its entire vocabulary is name a
box, name an action.
The real gesture set is small and physical: tap, double_tap, long_press, swipe, plus
navigation moves like home_screen, go_back, and force_quit. There are 12 tools in all —
see the MCP tools reference for the complete list.
3. Move & touch
Section titled “3. Move & touch”This is the half with no software analogue, so here’s the honest mechanical picture.
The orchestrator takes the center of the chosen bbox (still in 0–1 fractions) and
converts it to real arm coordinates in millimetres using a calibrated affine map — a
straight-line scale-plus-shift formula called pct_to_grbl. That map is exactly what
calibration fits, by tapping a known grid and learning how screen
fractions correspond to arm positions.
The arm itself is two motions:
- X/Y is a hand-built CoreXY gantry — a from-scratch frame where two NEMA 17 stepper motors share the load to move the stylus across the screen plane. (CoreXY just means both motors pull on belts together to produce X and Y.) This is not a ready-made consumer pen plotter; it’s a gantry you build.
- Z (the touch) is a 12V push-pull solenoid — a fast electromagnet with a ~10mm stroke, not
a Z servo and not a plotter pen. The capacitive stylus tip threads onto the solenoid core. A
tapis a crisp solenoid down-and-up; along_pressjust holds the tip down longer (~1.2s).
The solenoid uses a hit-and-hold scheme: it strikes hard to make firm contact, then drops to a softer current to hold the tip in place without buzzing. The exact strike/hold values live in Gestures.
Two reliability behaviours are worth calling out:
- A busy lock. Every gesture holds a lock, so actions never overlap — the arm finishes one move before another can start.
- Auto-park. After each move the arm parks the stylus off-screen, so the next photo is unobstructed and the LOOK phase sees a clean frame.
4. Check
Section titled “4. Check”PhysiClaw looks again and produces a fresh listing. The agent compares it to what it expected. There are three outcomes:
- It changed as planned → move on to the next action.
- Nothing changed → the tap missed or the box was wrong; re-aim and try again.
- Something unexpected appeared (a popup, an ad, a permission prompt) → that’s simply the new state. The agent reads it and decides again — no brittle script to fall out of.
This observe the result, then decide again design is the whole reliability story. PhysiClaw recovers from surprises the same way a person would: by looking and trying once more. It’s the reason one camera and one arm are enough — the verification comes from re-looking, not from more sensors.
Two ways to look: peek vs screenshot
Section titled “Two ways to look: peek vs screenshot”LOOK actually has two settings, and the agent picks based on what it needs:
peek | screenshot | |
|---|---|---|
| Source | the overhead camera | the phone’s own capture |
| Speed | ~4s | ~12s |
| Sharpness | good enough for most targets | pixel-perfect |
| Side effects | none (non-mutating) | mutating — triggers the phone’s screenshot gesture, which apps can notice |
peek is the default — fast and invisible. The agent only escalates to screenshot when a
target is too small for the camera to resolve, or glare makes the frame unreadable.