Skip to content

System architecture

PhysiClaw is one Python server with hardware on one side and a brain on the other: the brain decides, the MCP server turns those decisions into camera frames and arm moves, and a small phone-side bridge lets the server reach into iOS for the few things a stylus can’t do. Everything below is what sits between “tap that button” and a stylus actually touching glass.

┌──────────────────────────────────────────────────────────────┐
│ BRAIN — sees a screen, picks a bbox + a gesture │
│ external MCP client (Claude Desktop, …) OR built-in agent │
└───────────────────────────────┬──────────────────────────────┘
│ MCP (12 tools: peek, tap, swipe, …)
┌──────────────────────────────────────────────────────────────┐
│ PhysiClaw MCP SERVER (Python · core/) │
│ orchestrator · vision (OCR + icon detect) · calibration math │
└───────┬───────────────────────┬───────────────────┬──────────┘
│ USB-UVC │ USB-serial │ HTTP over Wi-Fi (LAN)
▼ ▼ ▼
┌─────────────┐ ┌────────────────┐ ┌──────────────────────┐
│ ONE overhead│ │ GRBL / FluidNC │ │ PHONE-SIDE BRIDGE │
│ camera │ │ controller │ │ • bridge web page │
└──────┬──────┘ │ X·Y gantry + │ │ (Safari, /bridge) │
│ looks down │ Z SOLENOID │ │ • AssistiveTouch │
│ └───────┬────────┘ │ • 3 iOS Shortcuts │
│ │ stylus tip └──────────┬───────────┘
│ ▼ │ runs on
└─────────────► ┌──────────────┐ ◄──────────────┘
│ PHONE │
│ (unlocked) │
└──────────────┘

The rest of this page walks each box, then the wires between them.

PhysiClaw exposes its hardware as a set of MCP tools — the Model Context Protocol, the same standard Claude Desktop and other clients already speak. Anything that talks MCP can drive the arm. There are two supported brains, and they use the identical tool surface:

External MCP client

Point Claude Desktop (or any MCP client) at the server, give it a goal, and it calls peek / tap / swipe directly. You bring your own model and your own chat loop.

Built-in agent

PhysiClaw ships its own brain under agent/ — a native tool loop with memory, skills, and cron/poll triggers, descended from OpenClaw. It calls the same 12 tools, just without a desktop app in the loop.

Either way the brain’s whole job is the same: read an annotated screen, pick one bounding box and one gesture, and let the server do the rest. It never sees motor coordinates or pixels.

A single Python process, structured as a few focused modules. The orchestrator (core/orchestration/orchestrator.py) is the spine — it owns the arm, the camera, the calibration, and one busy-lock so two tool calls can never move the arm at once. Every tool acquires that lock, does its work, and parks the stylus off-screen on the way out.

  • Directorysrc/physiclaw/
    • Directorycore/ the robot — hardware as MCP tools
      • Directoryserver/ FastMCP instance + the 12 tool definitions
      • Directoryorchestration/ the orchestrator: lifecycle, lock, gestures
      • Directoryvision/ OCR + ONNX icon detection → element listing
      • Directorycalibration/ the affine transforms learned during setup
      • Directoryhardware/ arm, camera, GRBL/FluidNC, solenoid
      • Directorybridge/ LAN bridge: phone page, screenshot upload, clipboard
    • Directoryagent/ the built-in brain (one MCP client among many)

The pieces you’ll meet most often:

  • vision — turns a camera frame (or a phone screenshot) into a tidy element listing: every button, icon, and line of text as id · kind · label · bbox · conf. Covered in How it sees.
  • calibration — holds the two affine maps that connect the three coordinate systems (screen fractions, camera pixels, arm millimetres). Covered in Calibration math.
  • hardware — wraps the GRBL/FluidNC controller. GRBL is the open-source G-code firmware that runs the X/Y gantry; FluidNC is the ESP32 variant PhysiClaw flashes. Z isn’t a motor — it’s a solenoid, a fast electromagnet that snaps the stylus tip down and lets it spring back up.
overhead camera (ONE)
│ looks straight down, ~25 cm up
┌───────────┐
stylus tip ───►│ PHONE │ X·Y gantry slides the tip across the glass;
on a solenoid │ screen │ the solenoid drops it to register a touch
└───────────┘

There is one camera, mounted overhead looking straight down at the screen — no second camera, no depth sensor. Reliability comes from re-looking after every move, not from more hardware. The arm is an X/Y gantry that positions the stylus over any point on the screen; the solenoid on the tip is the Z axis. A tap is a crisp down-and-up pulse (the firmware hits the coil hard, then holds at a lower duty so a long-press doesn’t drop); a long-press just holds the tip down longer. To the phone, that tip is indistinguishable from a fingertip, so any app works with no per-app setup.

A stylus can do almost everything a finger can — but not quite everything. It can’t ask iOS for a pixel-perfect screenshot, and it can’t type into the system clipboard. The bridge fills exactly those gaps, and nothing more. It has three parts, all set up once during Prepare the phone:

  1. A bridge web page. The phone opens http://<server-ip>:8048/bridge in Safari (you scan a QR code to pair — both devices just have to be on the same Wi-Fi). The page polls the server four times a second and renders whatever the server tells it to: calibration targets during setup, or queued clipboard text during a run.

  2. AssistiveTouch. iOS’s floating accessibility button. The arm physically taps the AssistiveTouch button to fire system actions a stylus otherwise couldn’t reach — a single tap takes a screenshot, a double-tap and long-press run Shortcuts.

  3. Three iOS Shortcuts. Bound to those AssistiveTouch gestures: take a screenshot, upload the latest screenshot to the server, and fetch queued text into the clipboard.

So when the brain calls screenshot, the server doesn’t capture pixels itself — it drives the arm to tap AssistiveTouch, the iOS Shortcut grabs the phone’s own screenshot and POSTs the bytes back over Wi-Fi, and the server runs vision on those pixels. The clipboard works the same way: the server queues text, the arm long-presses AssistiveTouch, and the Shortcut pulls the text in. The bridge is the only thing on the phone, and it only exists for screenshots, the clipboard, and showing calibration markers.

HopTransportPayload
Brain ↔ serverMCPtool calls + image / listing results
Server ↔ cameraUSB (UVC)JPEG frames
Server ↔ controllerUSB serialG-code (GRBL dialect)
Server ↔ phone bridgeHTTP over Wi-Fi (LAN)poll state, screenshot upload, clipboard text

Three of these are obvious cables; the fourth is the one the old “untouched phone” story missed. The phone never connects to the server by USB — it reaches it over your local network, which is why both have to share a Wi-Fi. With all four wired, one full action looks like:

brain: peek → server crops camera frame, runs OCR+icons, returns listing
brain: tap [bbox] → server maps bbox → arm mm, moves, fires solenoid, parks off-screen
brain: peek → server re-photographs, returns a fresh listing to compare

That look → decide → move → check rhythm is the whole control loop — and because every step ends by looking, a surprise is just the next thing to react to.