Interact with iOS simulator or Android emulator/device using snapshot-based coordinates. Uses accessibility tree snapshots for precise element targeting, with screenshot verification as fallback. Use when navigating the app on a simulator/emulator.
ALL simulator/emulator interaction MUST go through agent-device commands. No exceptions, no fallbacks.
Banned tools/commands (never use these for device interaction, even if they seem easier):
adb for UI interaction — no input tap, input swipe, input text, screencap, etc. Allowed exceptions: adb devices, adb wait-for-device, adb install, adb reverse, adb shell getprop, adb shell am, adb shell pm, adb shell screenrecord + adb pull (see "Android Recording Workaround"), adb shell kill/adb shell pidof (for stopping screenrecord)mobile_click_on_screen_at_coordinates, mobile_take_screenshot, mobile_list_elements_on_screen, , , , , or any other toolmobile_swipe_on_screenmobile_type_keysmobile_press_buttonmobile_long_press_on_screen_at_coordinatesmobile_*xcrun simctl — no simctl io screenshot, simctl openurl, etc. Allowed exceptions: simctl list devices, simctl get_app_container, simctl install, simctl launchosascript / AppleScript for simulator controlWhy: agent-device manages sessions, coordinate translation, and daemon state. Mixing in other tools causes session conflicts, stale state, and unreliable behavior.
wait, sleep, and any other waiting mechanism. The app is fast — transitions and network responses complete quickly. The only exception is app launch (open), which may take a few seconds to fully load.Control the iOS simulator or Android emulator using agent-device. The primary interaction method is snapshot-based: take an accessibility tree snapshot, find the target element's rect, compute its center, and press.
Do NOT use these as standalone subcommands:
click, find, fill, focus, get text, get attrs, scrollintoview, is, wait text, wait @ref, diff snapshotAllowed agent-device subcommands: snapshot, screenshot, press, type, scroll, swipe, longpress, back, home, app-switcher, wait <ms>, open, close, keyboard dismiss, appstate, clipboard, alert, settings, record, devices, apps, batch, push, logs, network.
Always target a specific device by name using --device "<name>" to avoid launching the wrong simulator/emulator. At the start of each session:
agent-device devices to list available devicesopen (see below)--device "<name>" on open — subsequent commands in the same session inherit itA previous conversation may have left an active session bound to the device. Calling open with a new session or the default session will fail with a conflict error. Always probe first using appstate (lightweight, no file output):
# Probe: check if a session already owns this device
agent-device appstate --device "iPhone 16"
| Outcome | What it means | What to do |
|---|---|---|
| Succeeds | Default session already owns this device | Use it — no open needed |
Device is already in use by session "X" | Session X owns this device | Use --session X for all commands (no open needed) |
Session "default" is bound to device "Y" | Default session owns a different device | Use a new --session <name> and proceed with open |
No active session / device not found | No session exists yet | Proceed with open --device "<name>" normally |
iOS:
# First: discover available devices
agent-device devices
# Probe for existing session
agent-device appstate --device "iPhone 16"
# If error says "in use by session ios16" → use --session ios16
# If succeeds → default session works, skip open
# Only if no session exists: open with explicit device targeting
agent-device open FlatListPro --device "iPhone 16"
# Subsequent commands — no --device needed (session remembers)
agent-device snapshot -i -c --json # primary: get elements with rects
agent-device press <x> <y>
agent-device screenshot /tmp/verify.png && sips --resampleHeight 852 /tmp/verify.png >/dev/null # verification
Android (also requires --session and --platform):
Use agent-device apps --platform android --user-installed to discover the installed package name.
# Probe for existing session
agent-device appstate --device "Android35" --session droid --platform android
# If succeeds → session already exists, skip open
# Only if no session exists: bind session to Android (replace <package> with actual package name)
agent-device open <package> --session droid --platform android \
--device "Android35" \
--activity <package>/.MainActivity
# All subsequent commands: just --session droid
agent-device snapshot -i -c --json --session droid # primary
agent-device press <x> <y> --session droid
agent-device screenshot /tmp/verify.png --session droid && sips --resampleHeight 852 /tmp/verify.png >/dev/null # verification
The primary method uses the accessibility tree snapshot for exact element coordinates. Screenshots are the fallback for visual verification.
agent-device snapshot -i -c --json
This returns interactive (-i) elements with their rect coordinates (-c) in JSON format. Each element looks like:
{
"@ref": "@e25",
"role": "button",
"label": "Settings",
"rect": {"x": 141, "y": 2032, "width": 154, "height": 154}
}
Search the snapshot output for your target by matching label, identifier, or value. Example: looking for "Settings" → find the element with "label": "Settings".
Calculate the center of the element's rect:
x = rect.x + rect.width / 2y = rect.y + rect.height / 2Then press at those coordinates.
After pressing, take a screenshot to confirm the action worked:
agent-device screenshot /tmp/verify.png && sips --resampleHeight 852 /tmp/verify.png >/dev/null
Then Read /tmp/verify.png to view it.
When an element is not in the accessibility tree (e.g., canvas-rendered content, custom drawn views), fall back to screenshots with percentage-based coordinate estimation.
iOS and Android use different coordinate systems for press. The exact dimensions vary by device. You MUST discover them dynamically at the start of each session.
press takes raw pixel coordinates (same as screenshot dimensions)press takes logical point coordinates (screenshot pixels / scale factor)
Snapshot rect values are already in the correct press coordinate space — skip this if you're only using snapshots. This is only needed when estimating coordinates from screenshots. Run once per platform:
agent-device screenshot /tmp/screen.png # (add --session droid for Android)
sips -g pixelWidth -g pixelHeight /tmp/screen.png
Then compute the press dimensions:
# Android: press coords = screenshot pixels
PRESS_W = RAW_W
PRESS_H = RAW_H
# iOS: press coords = screenshot pixels / scale
PRESS_W = RAW_W / 3 # (use /2 for iPhone SE or iPad)
PRESS_H = RAW_H / 3
Remember these values for the rest of the session. All coordinate calculations use them.
x = PRESS_W * (x_percent / 100)
y = PRESS_H * (y_percent / 100)
open FlatListPro --device "iPhone 16" # Launch app (iOS — always specify --device on first open)
open <package> \ # Launch app (Android — discover package name with `apps --platform android --user-installed`)
--device "Android35" --session droid --platform android \
--activity <package>/.MainActivity
close FlatListPro # Kill app
back # Navigate back (Android: in-app; iOS: may go to previous app)
home # Device home screen
app-switcher # Open app switcher
snapshot -i -c --json # Interactive elements with rects (primary method)
press <x> <y> # Tap (iOS=logical points, Android=pixels)
press <x> <y> --double-tap # Double-tap
longpress <x> <y> [durationMs] # Long press (default 500ms)
type "text" # Type into focused field (tap input first)
scroll <up|down|left|right> [0-1] # Scroll in direction with amount
swipe <x1> <y1> <x2> <y2> [durationMs] # Precise swipe between coordinates
wait <ms> # Wait milliseconds (max 500ms — the app is fast)
screenshot /tmp/screen.png # Save screenshot
record start ./recording.mov # Start video recording (iOS only — see below for Android)
record stop # Stop recording (iOS only)
To view a screenshot, downsample and read:
agent-device screenshot /tmp/screen.png && sips --resampleHeight 852 /tmp/screen.png >/dev/null
Then Read /tmp/screen.png.
agent-device record is broken on Android emulators (API 35+) — it sends SIGINT to the local adb process instead of the on-device screenrecord, producing a corrupt MP4. Use adb directly for Android recording.
First, resolve the serial once per session (store in $SERIAL):
SERIAL=$(adb devices | grep -w device | head -1 | cut -f1)
Then use it for recording:
# Start (run in background)
adb -s $SERIAL shell screenrecord /sdcard/agent-rec.mp4 &
# Stop (SIGINT the on-device process, then pull)
adb -s $SERIAL shell kill -INT $(adb -s $SERIAL shell pidof screenrecord)
sleep 2
adb -s $SERIAL pull /sdcard/agent-rec.mp4 /tmp/recording.mp4
adb -s $SERIAL shell rm -f /sdcard/agent-rec.mp4
Note: screenrecord only encodes frames when the display changes — interact with the UI during recording or you'll get a single-frame file.
devices # List available devices
apps --platform ios --user-installed # List installed apps
appstate # Show foreground app/activity (useful on Android)
keyboard dismiss # Dismiss on-screen keyboard (Android)
clipboard read # Read clipboard (iOS only)
clipboard write "text" # Write to clipboard
settings appearance dark # Switch to dark mode
settings appearance light # Switch to light mode
settings wifi off # Toggle wifi
settings permission grant camera # Grant camera permission
agent-device open can hang indefinitely on CI. Set reasonable timeouts and be prepared to fall back to code-only verification with unit tests.org.reactjs.native.example.FlatListPro) instead of FlatListPro.back on iOS navigates to the previous app (not always within the current app) — use press on the back arrow insteadpress before using typeswipe down near the top of the screen can trigger the notification shade — start swipes well within the content areasnapshot over screenshots for finding elements — it gives exact coordinatesNot every press needs a screenshot afterward. Take one when:
Skip it when tapping obvious, large targets (tab bar items, prominent buttons) where the next snapshot or action will confirm success anyway.
A snapshot gives you rects for every interactive element on screen. If you need to tap multiple elements on the same screen (e.g., fill a form), compute all the centers from one snapshot and press them in sequence — don't re-snapshot between each tap unless the screen layout changes (navigation, modal dismiss, keyboard appearing).
Screenshots are too slow (~300ms per capture) to catch brief loading spinners or animations. Use video recording + frame extraction instead. agent-device sometimes doesn't record properly unless one press has been performed with it. It can lead to a small file.
ffmpeg at high FPSrecord start needs ~3 seconds of lead time before performing the action. The recording daemon takes time to initialize — without this delay, the recording captures a static image and the action is missed entirely.
Similarly, wait at least 4-5 seconds after the action before calling record stop to capture the full animation and settle.
IMPORTANT: Do NOT put recording commands inside a bash script. When record start, sleep, action commands, and record stop are all in one script, the recording often captures only a fraction of a second. Instead, run each step as a separate Bash tool call:
iOS:
# Step 1: Start recording (separate Bash call)
agent-device record start /tmp/evidence.mov --session ios
# Step 2: Wait + perform action + wait (separate Bash call)
sleep 3 && agent-device swipe 197 340 197 680 800 --session ios && sleep 5
# Step 3: Stop recording (separate Bash call)
agent-device record stop --session ios
Android (uses adb workaround — see "Android Recording Workaround" above):
# Step 1: Start recording (separate Bash call)
adb -s $SERIAL shell screenrecord /sdcard/agent-rec.mp4 &
# Step 2: Wait + perform action + wait (separate Bash call)
sleep 3 && agent-device swipe 540 700 540 1400 800 --session droid && sleep 5
# Step 3: Stop + pull recording (separate Bash call)
adb -s $SERIAL shell kill -INT $(adb -s $SERIAL shell pidof screenrecord) && sleep 2 && adb -s $SERIAL pull /sdcard/agent-rec.mp4 /tmp/evidence.mp4 && adb -s $SERIAL shell rm -f /sdcard/agent-rec.mp4
Do NOT guess which frames show the action. Use MD5 hashes to find frames that actually differ:
# Find which frames are unique (not identical to previous)
prev_hash=""
for f in /tmp/frames/frame-*.png; do
hash=$(md5 -q "$f")
if [[ "$hash" != "$prev_hash" ]]; then
echo "$(basename $f): CHANGED"
prev_hash="$hash"
fi
done
If ALL frames have the same hash, the recording did not capture the action — re-record with more lead time.
iOS:
# Step 1 (separate Bash call): Start recording
agent-device record start /tmp/loading-evidence.mov
# Step 2 (separate Bash call): Wait for recording to initialize, perform action, wait for completion
sleep 3 && agent-device swipe $X_MID $Y_35PCT $X_MID $Y_75PCT 500 && sleep 5
# Step 3 (separate Bash call): Stop recording
agent-device record stop
Android:
# Step 1 (separate Bash call): Start recording
adb -s $SERIAL shell screenrecord /sdcard/agent-rec.mp4 &
# Step 2 (separate Bash call): Wait, perform action, wait
sleep 3 && agent-device swipe $X_MID $Y_35PCT $X_MID $Y_75PCT 500 --session droid && sleep 5
# Step 3 (separate Bash call): Stop + pull
adb -s $SERIAL shell kill -INT $(adb -s $SERIAL shell pidof screenrecord) && sleep 2 && adb -s $SERIAL pull /sdcard/agent-rec.mp4 /tmp/loading-evidence.mp4 && adb -s $SERIAL shell rm -f /sdcard/agent-rec.mp4
# Step 4 (same or separate call): Extract frames + find changes
rm -rf /tmp/loading-frames && mkdir -p /tmp/loading-frames
ffmpeg -y -i /tmp/loading-evidence.mov -vf "fps=30" /tmp/loading-frames/frame-%04d.png 2>/dev/null
# Find changed frames via MD5
prev_hash=""
for f in /tmp/loading-frames/frame-*.png; do
hash=$(md5 -q "$f")
if [[ "$hash" != "$prev_hash" ]]; then
echo "$(basename $f): CHANGED"
prev_hash="$hash"
fi
done
Then downsample and read the changed frames:
# Downsample specific changed frames for LLM viewing
sips --resampleHeight 852 /tmp/loading-frames/frame-0090.png --out /tmp/loading-frames/view-0090.png >/dev/null
Read /tmp/loading-frames/view-0090.png
| Scenario | Approach |
|---|---|
| Navigating / tapping UI elements | snapshot -i -c --json + compute center + press |
| Verifying a loading spinner exists | Video + frame extraction |
| Visual verification after an action | screenshot + downsample + Read |
| Element not in accessibility tree | screenshot + percentage estimation |
| Evidence for PR / bug report | Video recording (share .mov file) |
swipe from ~35% down to ~77% down (within the content area)scroll down when at top of list — swipe down can trigger the notification shadeFor quickly performing a sequence of interactions (press, swipe, type), use a bash script. Manual step-by-step execution is too slow to catch fleeting UI states. Note: this is for interaction commands only — record start/record stop must still be separate Bash calls (see "Capturing Transient States" above).
bash /tmp/test-script.sh