General-purpose skill for navigating and interacting with an iOS app running in a Simulator using WebDriverAgent (WDA). Use when the user asks to tap buttons, swipe, scroll, type text, check what's on screen, go to a tab or screen, automate a flow, or verify UI state in a simulator app. Also use when the user wants to take screenshots, inspect the accessibility tree, explore screen hierarchy, or test a UI flow end-to-end on a simulator. Even if the user says something casual like "open settings in the app", "click that button", or "what's showing on the simulator" — this skill applies.
Clone and build WebDriverAgent:
mkdir -p .build
git clone https://github.com/appium/WebDriverAgent.git .build/WebDriverAgent
cd .build/WebDriverAgent
xcodebuild build-for-testing \
-project WebDriverAgent.xcodeproj \
-scheme WebDriverAgentRunner \
-destination "platform=iOS Simulator,name=iPhone 17" \
CODE_SIGNING_ALLOWED=NO
Start and stop WDA using the lifecycle scripts. WDA must be running before using any curl commands below.
# Start WDA (waits until ready, ~60s first time)
ruby scripts/wda-start.rb [--udid <UDID>] [--port <PORT>]
# Check if WDA is running
curl -s http://localhost:8100/status | head -c 200
# Stop WDA
ruby scripts/wda-stop.rb [--port <PORT>]
Both scripts auto-detect the first booted simulator. Use --udid to target a specific one.
Always prefer the accessibility tree over screenshots. The tree is text-based, faster to process, and doesn't require viewing an image.
GET /source?format=descriptionWDA offers two tree formats via GET /source?format=<FORMAT>:
format=description -- compact plaintext (~25 KB)curl -s 'http://localhost:8100/source?format=description' | jq -r .value
Returns a human-readable indented tree. Each line shows an element with its type, memory address, frame as {{x, y}, {width, height}}, and optional attributes (identifier, label, Selected, etc.):
NavigationBar, 0x105351660, {{0.0, 62.0}, {402.0, 54.0}}, identifier: 'my-site-navigation-bar'
Button, 0x105351a20, {{16.0, 62.0}, {44.0, 44.0}}, identifier: 'BackButton', label: 'Site Name'
StaticText, 0x105351b40, {{178.7, 73.7}, {44.7, 20.7}}, label: 'Posts'
Use this format by default. It's ~15x smaller than JSON, easy to reason about, and contains all the information needed for navigation (types, labels, identifiers, and coordinates).
format=json -- structured data (~375 KB)curl -s 'http://localhost:8100/source?format=json' > /tmp/wda-tree.json
Returns deeply nested JSON. Use this when you need to programmatically extract coordinates or search for elements with jq. The response has the structure {"value": <root_node>, "sessionId": "..."}. Each node has:
| Field | Description |
|---|---|
type | Element type (e.g., Button, StaticText, NavigationBar) |
label | Accessibility label (user-visible text) |
name | Accessibility identifier (developer-assigned ID) |
value | Current value (e.g., text field contents, switch state) |
rect | {"x": N, "y": N, "width": N, "height": N} -- structured, use for tap coordinates |
frame | Same as rect but as a string: "{{x, y}, {w, h}}" |
isEnabled | Whether the element is interactive |
children | Array of child nodes |
Search example with jq:
cat /tmp/wda-tree.json | jq '.. | objects | select(.label == "Settings")'
From the description format, parse the frame {{x, y}, {width, height}} and compute:
tap_x = x + width / 2
tap_y = y + height / 2
From the JSON format, use the rect object:
tap_x = rect.x + rect.width / 2
tap_y = rect.y + rect.height / 2
Use this priority order when locating elements in the tree:
identifier / name -- most stable; developer-assigned, unlikely to change across localeslabel -- accessibility label; user-visible text, may change with localizationtype + context -- e.g., "Button inside NavigationBar" or "Cell inside Table"In the description format, search the text output for labels or identifiers. In the JSON format, use jq:
# Exact match by identifier
cat /tmp/wda-tree.json | jq '.. | objects | select(.name == "settings-button")'
# Exact match by label
cat /tmp/wda-tree.json | jq '.. | objects | select(.label == "Settings")'
# Partial match by label
cat /tmp/wda-tree.json | jq '.. | objects | select(.label? // "" | contains("Settings"))'
# Type + context: find Buttons inside NavigationBar
cat /tmp/wda-tree.json | jq '.. | objects | select(.type == "NavigationBar") | .. | objects | select(.type == "Button")'
The root node's rect gives the screen dimensions (e.g., width: 393, height: 852).
Most action endpoints require a session ID. Create one if /status doesn't return a sessionId:
# Create session
curl -s -X POST http://localhost:8100/session \
-H 'Content-Type: application/json' \
-d '{"capabilities":{"alwaysMatch":{}}}' | jq .
The session ID is at value.sessionId in the response. Use it in subsequent action URLs as SESSION_ID.
To check for an existing session, look at the sessionId field in the /status response.
All action endpoints use POST /session/SESSION_ID/actions with W3C WebDriver pointer actions.
curl -s -X POST http://localhost:8100/session/SESSION_ID/actions \
-H 'Content-Type: application/json' \
-d '{
"actions": [{
"type": "pointer",
"id": "finger1",
"parameters": {"pointerType": "touch"},
"actions": [
{"type": "pointerMove", "duration": 0, "x": X, "y": Y},
{"type": "pointerDown"},
{"type": "pointerUp"}
]
}]
}'
WDA can find and tap elements directly without computing coordinates. This is useful when an element has a stable accessibility identifier:
# Find the element by accessibility identifier
curl -s -X POST http://localhost:8100/session/SESSION_ID/elements \
-H 'Content-Type: application/json' \
-d '{"using": "accessibility id", "value": "settings-button"}' | jq .
# Tap it (ELEMENT_ID comes from the response above, at value[0].ELEMENT)
curl -s -X POST http://localhost:8100/session/SESSION_ID/element/ELEMENT_ID/click
The coordinate approach above is preferred because it works directly with the tree data already being fetched. Use element-based tapping when coordinate parsing is awkward or when interacting with elements found by predicate.
Add a pause between pointerDown and pointerUp. Duration is in milliseconds.
curl -s -X POST http://localhost:8100/session/SESSION_ID/actions \
-H 'Content-Type: application/json' \
-d '{
"actions": [{
"type": "pointer",
"id": "finger1",
"parameters": {"pointerType": "touch"},
"actions": [
{"type": "pointerMove", "duration": 0, "x": X, "y": Y},
{"type": "pointerDown"},
{"type": "pause", "duration": 1000},
{"type": "pointerUp"}
]
}]
}'
Move from (x1, y1) to (x2, y2) with a duration (milliseconds) on the second pointerMove.
curl -s -X POST http://localhost:8100/session/SESSION_ID/actions \
-H 'Content-Type: application/json' \
-d '{
"actions": [{
"type": "pointer",
"id": "finger1",
"parameters": {"pointerType": "touch"},
"actions": [
{"type": "pointerMove", "duration": 0, "x": X1, "y": Y1},
{"type": "pointerDown"},
{"type": "pointerMove", "duration": 500, "x": X2, "y": Y2},
{"type": "pointerUp"}
]
}]
}'
Swipe direction guide (given screen size W x H):
(W/2, H/2 + H/6) to (W/2, H/2 - H/6)(W/2, H/2 - H/6) to (W/2, H/2 + H/6)(W/2 + W/4, H/2) to (W/2 - W/4, H/2)(W/2 - W/4, H/2) to (W/2 + W/4, H/2)(5, H/2) to (W*2/3, H/2)To go back to the previous screen:
(5, H/2) to (W*2/3, H/2) (see Swipe direction guide above)The button approach is more reliable because edge swipes can be finicky depending on gesture recognizers.
curl -s -X POST http://localhost:8100/session/SESSION_ID/wda/keys \
-H 'Content-Type: application/json' \
-d '{"value": ["h","e","l","l","o"]}'
The value array contains individual characters. An element must be focused first (tap a text field before typing).
Select all text and delete it:
# Select all (Ctrl+A) then delete
curl -s -X POST http://localhost:8100/session/SESSION_ID/wda/keys \
-H 'Content-Type: application/json' \
-d '{"value": ["\u0001"]}'
curl -s -X POST http://localhost:8100/session/SESSION_ID/wda/keys \
-H 'Content-Type: application/json' \
-d '{"value": ["\u007F"]}'
Alternatively, if you have an element ID:
curl -s -X POST http://localhost:8100/session/SESSION_ID/element/ELEMENT_ID/clear
After performing an action (tap, swipe, type), the UI may be animating or loading. Instead of using a fixed sleep, poll for the expected state:
This approach is more reliable than fixed delays because it adapts to variable animation durations and network load times.
To find an element in a long scrollable list:
screen_width - 30 to avoid tapping interactive elements)Use the same pattern for horizontal scroll views, adjusting swipe direction accordingly.
Use simctl for screenshots -- more reliable than WDA's base64 approach:
xcrun simctl io <UDID> screenshot /tmp/screenshot.png
To get the booted simulator's UDID:
xcrun simctl list devices booted -j | jq -r '.devices | to_entries[].value[] | select(.state == "Booted")'
screen_width - 30) to avoid accidentally tapping interactive elements in the center. Use center only when needed.duration: 1000 (1 second) for more reliable swipes.wda-start.rb again -- it will reconnect.TabBar in the tree. Its children are the individual tabs.WDA sessions can expire after inactivity. If action requests return HTTP 4xx errors, re-create the session:
curl -s -X POST http://localhost:8100/session \
-H 'Content-Type: application/json' \
-d '{"capabilities":{"alwaysMatch":{}}}' | jq .
After animations or screen transitions, previously fetched coordinates may be wrong. Always re-fetch the tree and recompute coordinates before tapping after any navigation action.
System alerts (location permissions, notification permissions, tracking prompts) can block interactions with the app. Before retrying a failed tap:
Alert or SheetIf actions consistently fail or the tree looks unexpected, the app may have crashed. Check and re-launch:
# Check if the app process is running
xcrun simctl list devices booted
# Re-launch the app
xcrun simctl launch <UDID> <BUNDLE_ID>
After re-launching, create a new WDA session before continuing.