macOS GUI automation CLI. Use steer to see the screen, click elements, type text, send hotkeys, scroll, drag, manage windows and apps, run OCR on Electron apps, and wait for UI conditions.
Binary: apps/steer/.build/release/steer
Run steer --help and steer help <command> to learn each command's flags before using it.
| Command | Purpose |
|---|---|
see | Takes a screenshot (PNG) and walks the accessibility tree. Screenshot always succeeds. Elements are best-effort — may be empty for Electron apps. Pass --ocr to fall back to OCR when the tree is empty. |
click | Click by element ID, label, or coordinates |
type | Type text into focused element or a target |
hotkey | Keyboard shortcuts (cmd+s, return, escape, etc.) |
scroll | Scroll up/down/left/right |
drag |
| Drag between elements or coordinates |
apps | List, launch, or activate apps |
screens | List displays with resolution and scale |
window | Move, resize, minimize, fullscreen, close windows |
ocr | Takes a screenshot and runs Vision OCR on it. Returns text with x/y positions. Use --store to make results clickable (O1, O2, etc.). Use when see returns no elements. |
focus | Show currently focused element |
find | Search elements by text in latest snapshot |
clipboard | Read/write system clipboard |
wait | Wait for app launch or element to appear |
Always pass --json for structured output.
You are controlling a real macOS desktop. You cannot see anything unless you explicitly look. You cannot assume anything worked unless you verify.
Before doing anything, understand the display setup, what's running, and capture the current state of every screen:
steer screens --json → which monitors exist, their resolution
steer apps --json → what apps are running
steer see --screen 0 --json → screenshot of screen 0 (primary)
steer see --screen 1 --json → screenshot of screen 1 (if exists)
Take a screenshot of each screen returned by steer screens. Read them. This gives you a baseline of the desktop before you start making changes.
Before interacting with any app, make sure it's the active window on the right screen:
steer apps activate Safari --json
steer see --app Safari --json → verify it's in front, read the state
NEVER chain multiple steer commands in one bash call. The screen changes after every action. You must look after every action.
The loop is:
steer see — look at the screensteer see — look again to confirm it workedBefore clicking anything:
steer see --app <app> --json to get a fresh snapshotsteer see again to confirm the click landedsteer focus --json to confirm the right element is targeted.
steer apps activate <AppName> --json to switch apps or steer click to focus the correct field.steer focus --jsonsteer type "text" --into T1 --json to click-then-type in one step when targeting a specific fieldsteer type "text" --jsonsteer see that the text appeared correctlysteer hotkey return --json, then steer see to confirm navigationBoth see and ocr save a screenshot PNG. The path is in the JSON output under "screenshot". You can read this image file to visually inspect what's on screen.
Native apps (Safari, Finder, Terminal): steer see --app <name> --json gives you the accessibility tree with labeled elements. If the element list is empty, the screenshot still exists — read it or try --ocr to fall back.
Electron apps (VS Code, Slack, Notion): Accessibility trees are empty. Use OCR instead:
steer ocr --app "VS Code" --store --json
With --store, OCR results become clickable elements (O1, O2, etc.).
Web pages: Accessibility tree may be shallow. Use steer ocr --app Safari --json to read all visible text with positions. Use the positions to click precisely.
Don't assume the UI is ready after an action. Use steer wait or run steer see in a loop:
steer wait --app Safari --json → wait for app to launch
steer wait --for "Submit" --app Safari --json → wait for element to appear
If there are multiple screens, coordinates and screenshots are per-screen. Always check steer screens --json first and use --screen <index> when needed.
Elements from steer see get role-based IDs: B (button), T (text field), S (static text), I (image), C (checkbox), L (link), M (menu item), O (OCR element), etc.
IDs regenerate with each snapshot. Always use IDs from the most recent steer see or steer ocr --store.
steer see after every actionsteer screens before clicking--json always — structured output is reliablesteer help <cmd> if you're unsure about a command's flags