External MCP client
Point Claude Desktop (or any MCP client) at the server, give it a goal, and it calls
peek / tap / swipe directly. You bring your own model and your own chat loop.
PhysiClaw is one Python server with hardware on one side and a brain on the other: the brain decides, the MCP server turns those decisions into camera frames and arm moves, and a small phone-side bridge lets the server reach into iOS for the few things a stylus can’t do. Everything below is what sits between “tap that button” and a stylus actually touching glass.
┌──────────────────────────────────────────────────────────────┐│ BRAIN — sees a screen, picks a bbox + a gesture ││ external MCP client (Claude Desktop, …) OR built-in agent │└───────────────────────────────┬──────────────────────────────┘ │ MCP (12 tools: peek, tap, swipe, …) ▼┌──────────────────────────────────────────────────────────────┐│ PhysiClaw MCP SERVER (Python · core/) ││ orchestrator · vision (OCR + icon detect) · calibration math │└───────┬───────────────────────┬───────────────────┬──────────┘ │ USB-UVC │ USB-serial │ HTTP over Wi-Fi (LAN) ▼ ▼ ▼ ┌─────────────┐ ┌────────────────┐ ┌──────────────────────┐ │ ONE overhead│ │ GRBL / FluidNC │ │ PHONE-SIDE BRIDGE │ │ camera │ │ controller │ │ • bridge web page │ └──────┬──────┘ │ X·Y gantry + │ │ (Safari, /bridge) │ │ looks down │ Z SOLENOID │ │ • AssistiveTouch │ │ └───────┬────────┘ │ • 3 iOS Shortcuts │ │ │ stylus tip └──────────┬───────────┘ │ ▼ │ runs on └─────────────► ┌──────────────┐ ◄──────────────┘ │ PHONE │ │ (unlocked) │ └──────────────┘The rest of this page walks each box, then the wires between them.
PhysiClaw exposes its hardware as a set of MCP tools — the Model Context Protocol, the same standard Claude Desktop and other clients already speak. Anything that talks MCP can drive the arm. There are two supported brains, and they use the identical tool surface:
External MCP client
Point Claude Desktop (or any MCP client) at the server, give it a goal, and it calls
peek / tap / swipe directly. You bring your own model and your own chat loop.
Built-in agent
PhysiClaw ships its own brain under agent/ — a native tool loop with memory, skills,
and cron/poll triggers, descended from OpenClaw. It calls the same 12 tools, just without
a desktop app in the loop.
Either way the brain’s whole job is the same: read an annotated screen, pick one bounding box and one gesture, and let the server do the rest. It never sees motor coordinates or pixels.
core/A single Python process, structured as a few focused modules. The orchestrator
(core/orchestration/orchestrator.py) is the spine — it owns the arm, the camera, the
calibration, and one busy-lock so two tool calls can never move the arm at once. Every tool
acquires that lock, does its work, and parks the stylus off-screen on the way out.
The pieces you’ll meet most often:
id · kind · label · bbox · conf. Covered in
How it sees. overhead camera (ONE) │ looks straight down, ~25 cm up ▼ ┌───────────┐ stylus tip ───►│ PHONE │ X·Y gantry slides the tip across the glass; on a solenoid │ screen │ the solenoid drops it to register a touch └───────────┘There is one camera, mounted overhead looking straight down at the screen — no second camera, no depth sensor. Reliability comes from re-looking after every move, not from more hardware. The arm is an X/Y gantry that positions the stylus over any point on the screen; the solenoid on the tip is the Z axis. A tap is a crisp down-and-up pulse (the firmware hits the coil hard, then holds at a lower duty so a long-press doesn’t drop); a long-press just holds the tip down longer. To the phone, that tip is indistinguishable from a fingertip, so any app works with no per-app setup.
A stylus can do almost everything a finger can — but not quite everything. It can’t ask iOS for a pixel-perfect screenshot, and it can’t type into the system clipboard. The bridge fills exactly those gaps, and nothing more. It has three parts, all set up once during Prepare the phone:
A bridge web page. The phone opens http://<server-ip>:8048/bridge in Safari (you scan
a QR code to pair — both devices just have to be on the same Wi-Fi). The page polls the
server four times a second and renders whatever the server tells it to: calibration targets
during setup, or queued clipboard text during a run.
AssistiveTouch. iOS’s floating accessibility button. The arm physically taps the AssistiveTouch button to fire system actions a stylus otherwise couldn’t reach — a single tap takes a screenshot, a double-tap and long-press run Shortcuts.
Three iOS Shortcuts. Bound to those AssistiveTouch gestures: take a screenshot, upload the latest screenshot to the server, and fetch queued text into the clipboard.
So when the brain calls screenshot, the server doesn’t capture pixels itself — it drives the
arm to tap AssistiveTouch, the iOS Shortcut grabs the phone’s own screenshot and POSTs the
bytes back over Wi-Fi, and the server runs vision on those pixels. The clipboard works the same
way: the server queues text, the arm long-presses AssistiveTouch, and the Shortcut pulls the text
in. The bridge is the only thing on the phone, and it only exists for screenshots, the clipboard,
and showing calibration markers.
| Hop | Transport | Payload |
|---|---|---|
| Brain ↔ server | MCP | tool calls + image / listing results |
| Server ↔ camera | USB (UVC) | JPEG frames |
| Server ↔ controller | USB serial | G-code (GRBL dialect) |
| Server ↔ phone bridge | HTTP over Wi-Fi (LAN) | poll state, screenshot upload, clipboard text |
Three of these are obvious cables; the fourth is the one the old “untouched phone” story missed. The phone never connects to the server by USB — it reaches it over your local network, which is why both have to share a Wi-Fi. With all four wired, one full action looks like:
brain: peek → server crops camera frame, runs OCR+icons, returns listingbrain: tap [bbox] → server maps bbox → arm mm, moves, fires solenoid, parks off-screenbrain: peek → server re-photographs, returns a fresh listing to compareThat look → decide → move → check rhythm is the whole control loop — and because every step ends by looking, a surprise is just the next thing to react to.