Skip to content

MCP tools

PhysiClaw exposes 12 tools over the Model Context Protocol, and the agent works the same way through every one: take a photo, pick a bbox (a rectangle [left, top, right, bottom] in 01 screen fractions), and do something there. It never deals in pixels or motor coordinates — it names a box and an action, and PhysiClaw turns that into arm motion.

The tools are grouped here in wire order — the order the agent sees them — because the most-reached-for tools (peek, tap) sit first on purpose. Each tool’s first line is its one-job summary; the rest is parameters, timing, and the “what to do next” hint PhysiClaw appends to every result to keep the loop moving.

The agent calls one of these before an action to ground a target’s bbox, and again after to check the screen changed. Both return the same shape: [Image, listing] — a JPEG with icon boxes drawn, plus a plain-text listing, one row per element:

id [kind] "label" [left,top,right,bottom] conf

The default view — a frame from the overhead camera, ~4s, non-mutating.

  • Parameters: none.
  • When: before any tap/swipe to find the target; after, to verify. An identical listing across two peeks means the screen didn’t change.

The phone’s own pixel-perfect capture, ~12s, mutating — the escalation tool.

  • Parameters: none.
  • Why mutating: it fires the iOS screenshot gesture, which apps can observe and react to (share sheets, “similar items” panels, watermarked frames). Always peek after a screenshot before tapping — the screen state may have shifted.
  • Use only when peek can’t do the job: the target is too small for the camera to resolve, glare or motion blur makes the frame unreadable, or there’s fine print to read.

peek is three times faster and leaves no trace, so it’s the default; screenshot is the fallback. Why the split exists is covered in How it sees.

Every action tool takes a bbox and operates at its center. The arm drives the stylus there; the solenoid drops the tip, and the gesture’s duration decides whether it reads as a tap, a double-tap, or a hold.

Tap once at the bbox center — buttons, links, list items, dismissing dialogs.

  • Next: peek to verify and ground the next target. An identical listing means the tap missed (retry once) or the bbox was wrong (pick a different element from the new peek).

Two quick taps ~150ms apart at the bbox center — for zooming or selecting a word.

  • Use for zooming maps / photos / web pages, or selecting a word in editable text.
  • For buttons, use tap.

Press and hold ~1.2s at the bbox center — context menus, edit mode, the paste popover.

  • Use for opening context menus, entering icon-rearrange edit mode, or triggering the paste popover after send_to_clipboard.
  • Next: always peek — you’ll need a fresh bbox from the popover that just appeared (Paste / Copy / etc.) before you can tap it.

swipe(bbox, direction, size="m", speed="medium")

Section titled “swipe(bbox, direction, size="m", speed="medium")”

Slide the stylus across bbox in direction — for scrolling, paging, dismissing cards, and opening Control / Notification Center.

ParameterValuesNotes
bbox[l,t,r,b] in 0–1the gesture origin
directionup down left rightthe stylus motion, not the page motion
sizes m l xl xxlstroke length, default m
speedslow medium faststylus velocity, default medium

The size scale maps to a real stroke length:

sizeLengthUse
s≈ 1cmsmall nudge
m≈ 2cmmost scrolls
l≈ 4cmlong scroll
xl≈ 6cmpage-sized scroll
xxl≈ 8cmfull-screen (Control / Notification Center)
  • Next: peek to verify the page scrolled and plan the next move.

These take no bbox — they issue a fixed iPhone system gesture.

Return to the home screen via the swipe-up-from-bottom gesture — a known launch pad.

  • Use to start a fresh task or recover from getting lost in app navigation.
  • Next: peek to plan your next tap on the home-screen icons.

Pop back one screen via the iPhone left-edge swipe.

  • Works in apps with a navigation stack (most do). The same screen after peek means either the gesture didn’t register (retry once) or this screen has no back action (modals, root tabs, lock screen) — try home_screen and re-enter, or tap an in-screen < / Back button.
  • Trap: full-screen image viewers (product images, Messages / WeChat photos) reclaim left/right swipes for image navigation, so the edge-swipe won’t pop. Close via the in-viewer X / Done button instead.

Force-quit the current app via the app-switcher gesture, ~7s — a hard reset.

  • Use after go_back hasn’t reached the right entry point: popups won’t dismiss, the back stack loops, or the wrong page keeps returning.
  • Lands on the home screen. Next: reopen the app fresh from there.

Unlock the phone with passcode 111111, ~12s.

It wakes the screen, swipes up, waits for Face ID to fail, OCRs the keypad, and taps each digit. The passcode is hardcoded to 111111 — a throwaway tool-phone code, so a real password never leaks through git or logs.

  • Next: peek to confirm you’re on the home screen and plan the next tap.

Copy text to the phone’s clipboard — pasting is far faster than tapping out each key.

The standard flow:

  1. send_to_clipboard(text)

  2. long_press(field_bbox) — opens the paste popover.

  3. tap the Paste (or 粘贴) button that appears.

Fall back to the on-screen keyboard only if the field rejects paste (passcode fields, some search bars).

sequence(step1, step2?, step3?, step4?, step5?)

Section titled “sequence(step1, step2?, step3?, step4?, step5?)”

Run up to 5 actions in one call — saves turns on deterministic flows where intermediate observations would add nothing (opening an app via tap → tap → tap, or paste + send a message).

Each step is a dict with two fields:

  • tool_name — one of tap / double_tap / long_press / swipe / send_to_clipboard.
  • arg — that tool’s argument: a bbox for tap/double_tap/long_press, {bbox, direction, size?, speed?} for swipe, a string for send_to_clipboard.
  • Next: peek to verify the final state and plan the next move.