GUI automation via visual detection. Clicking, typing, reading content, navigating menus, filling forms — all through screenshot → detect → act workflow. Supports macOS and Linux.
Before any GUI operation, run:
python3 {baseDir}/scripts/activate.py
This detects your OS, sets up the correct action commands, and outputs platform context.
After running, {baseDir}/actions/_actions.yaml contains your platform's commands.
OBSERVE → LEARN → ACT → VERIFY → SAVE
OBSERVE — Take screenshot → run OCR + detector → understand current state
→ read {baseDir}/skills/gui-observe/SKILL.md
LEARN — First time with an app? Save components to memory
→ read {baseDir}/skills/gui-learn/SKILL.md
→ learn_from_screenshot() auto-outputs app tips if available
ACT — Pick target → execute using _actions.yaml commands → verify
→ read {baseDir}/skills/gui-act/SKILL.md
→ read {baseDir}/actions/_actions.yaml for available commands
VERIFY — Screenshot again → confirm action succeeded
SAVE — Record state transitions to memory
→ read {baseDir}/skills/gui-memory/SKILL.md for memory structure
| Sub-Skill | When to read |
|---|---|
skills/gui-observe/SKILL.md | Before screenshots or detection |
skills/gui-learn/SKILL.md | Before learning a new app |
skills/gui-act/SKILL.md | Before any click/type action |
skills/gui-memory/SKILL.md | For memory structure details |
skills/gui-workflow/SKILL.md | For multi-step navigation |
skills/gui-setup/SKILL.md | For first-time machine setup |
skills/gui-report/SKILL.md | For task performance reporting |