Control macOS GUI apps visually — take screenshots, click, scroll, type. Use when the user asks to interact with any Mac desktop application's graphical interface.
Control any macOS GUI application through a screenshot → pick element → click → verify loop.
Platform: macOS only (requires Apple Vision framework for OCR)
System binaries (pre-installed on macOS):
python3 — via Homebrew (brew install python)screencapture — built-in macOS utilityPython packages — install from the skill directory:
pip3 install --break-system-packages -r {baseDir}/requirements.txt
The screenshot command captures a window, uses Apple Vision OCR to detect all text elements, draws numbered annotations on the image, and returns both:
/tmp/mac_use.png — numbered green boxes around each detected text[{num: 1, text: "Submit", at: [500, 200]}, {num: 2, text: "Cancel", at: [600, 200]}, ...] where at is the center point [x, y] on the 1000x1000 canvas (origin at top-left)You receive both by calling Bash (gets JSON with element list) and then Read on /tmp/mac_use.png (gets the visual). Always do both so you can cross-reference the numbers with what you see.
# List all visible windows
python3 {baseDir}/scripts/mac_use.py list
# Screenshot + annotate (returns image + numbered element list)
python3 {baseDir}/scripts/mac_use.py screenshot <app> [--id N]
# Click element by number (primary click method)
python3 {baseDir}/scripts/mac_use.py clicknum <N>
# Click at canvas coordinates (fallback for unlabeled icons)
python3 {baseDir}/scripts/mac_use.py click --app <app> [--id N] <x> <y>
# Scroll inside a window
python3 {baseDir}/scripts/mac_use.py scroll --app <app> [--id N] <direction> <amount>
# Type text (uses clipboard paste — supports all languages)
python3 {baseDir}/scripts/mac_use.py type [--app <app>] "text here"
# Press key or combo
python3 {baseDir}/scripts/mac_use.py key [--app <app>] <combo>
open -a "App Name" (optionally with a URL or file path)sleep 2python3 {baseDir}/scripts/mac_use.py screenshot <app> [--id N]
This returns JSON with file (image path) and elements (numbered text list)./tmp/mac_use.png to see the numbered elements visuallyclicknum N — pick the number of a detected text elementclick --app <app> x y — only for unlabeled icons (arrows, close buttons, cart icons) that have no text and therefore no numberclicknum, type, key, or scrollShow all visible app windows.
python3 {baseDir}/scripts/mac_use.py list
Returns JSON array: [{"app":"Google Chrome","title":"Wikipedia","id":4527,"x":120,"y":80,"w":1200,"h":800}, ...]
Capture a window, detect text elements via OCR, annotate with numbered markers, and return the element list. The target window is automatically raised to the top before capture, so overlapping windows are handled.
python3 {baseDir}/scripts/mac_use.py screenshot chrome
python3 {baseDir}/scripts/mac_use.py screenshot chrome --id 4527
<app>: fuzzy, case-insensitive match (e.g. "chrome" matches "Google Chrome")--id N: target a specific window ID (required when multiple windows of the same app exist)file: path to annotated screenshot (/tmp/mac_use.png)id, app, title, scale: window metadataelements: array of {num, text, at} — the numbered clickable text elements, where at is [x, y] center coordinates on the 1000x1000 canvas (origin at top-left)--id/tmp/mac_use_elements.json for clicknumClick on a numbered element from the last screenshot. This is the primary click method.
python3 {baseDir}/scripts/mac_use.py clicknum 5
python3 {baseDir}/scripts/mac_use.py clicknum 12
N: the element number from the last screenshot outputclicked_num, text, canvas coords, and absolute screen coordsClick at a position using canvas coordinates. Fallback only — use for unlabeled icons.
python3 {baseDir}/scripts/mac_use.py click --app chrome 500 300
python3 {baseDir}/scripts/mac_use.py click --app chrome --id 4527 500 300
Scroll inside an app window.
python3 {baseDir}/scripts/mac_use.py scroll --app chrome down 5
python3 {baseDir}/scripts/mac_use.py scroll --app notes up 10
up, down, left, rightType text into the currently focused input field.
python3 {baseDir}/scripts/mac_use.py type --app chrome "hello world"
python3 {baseDir}/scripts/mac_use.py type --app chrome "你好世界"
--app: activates the app first to ensure keystrokes go to the right windowPress a single key or key combination.
python3 {baseDir}/scripts/mac_use.py key --app chrome return
python3 {baseDir}/scripts/mac_use.py key --app chrome cmd+a
python3 {baseDir}/scripts/mac_use.py key --app chrome cmd+shift+s
--app: activates the app firstreturn, tab, escape, space, delete, backspace, up, down, left, rightcmd, ctrl, alt/opt, shiftclicknum over click — only use direct coordinates for unlabeled iconsmultiple_windows error, use list to see all windows, then pass --idlist to find them and --id to target themsleep 2-3 after open -a before taking a screenshotosascript -e 'tell application "AppName" to activate' && sleep 1 when the target app may be behind other windowsclick only)Screenshots are rendered onto a 1000x1000 canvas:
# 1. Open WeChat
open -a "WeChat"
sleep 3
# 2. Screenshot WeChat — find the mini program window
python3 {baseDir}/scripts/mac_use.py list
# → find the mini program window ID
# 3. Screenshot the mini program (annotated + element list)
python3 {baseDir}/scripts/mac_use.py screenshot 微信 --id 41266
# → returns: {"file": "/tmp/mac_use.png", "elements": [{num: 1, text: "搜索", at: [500, 200]}, ...]}
# → Read /tmp/mac_use.png to see annotated image
# 4. Click "搜索" (element #1)
python3 {baseDir}/scripts/mac_use.py clicknum 1
# 5. Type search query
python3 {baseDir}/scripts/mac_use.py type --app 微信 "炸鸡"
# 6. Press Enter
python3 {baseDir}/scripts/mac_use.py key --app 微信 return
sleep 2
# 7. Screenshot to see results
python3 {baseDir}/scripts/mac_use.py screenshot 微信 --id 41266
# → Read /tmp/mac_use.png, pick a restaurant by number
# 8. Click on a restaurant (e.g. element #5)
python3 {baseDir}/scripts/mac_use.py clicknum 5