Guide for implementing Google Gemini API image understanding - analyze images with captioning, classification, visual QA, object detection, segmentation, and multi-image comparison. Use when analyzing images, answering visual questions, detecting objects, or processing documents with vision.
This skill enables Claude to use Google's Gemini API for advanced image understanding tasks including captioning, classification, visual question answering, object detection, segmentation, and multi-image analysis.
pip install google-genai (Python 3.9+)The skill supports both Google AI Studio and Vertex AI endpoints.
The skill checks for GEMINI_API_KEY in this order:
export GEMINI_API_KEY="your-key".env.claude/.env.claude/skills/.env.claude/skills/gemini-vision/.envGet your API key: Visit Google AI Studio
To use Vertex AI instead:
# Enable Vertex AI
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional, defaults to us-central1
Or in .env file:
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1
Security: Never commit API keys to version control. Add .env to .gitignore.
# Analyze a local image
python scripts/analyze-image.py path/to/image.jpg "What's in this image?"
# Analyze from URL
python scripts/analyze-image.py https://example.com/image.jpg "Describe this"
# Specify model
python scripts/analyze-image.py image.jpg "Caption this" --model gemini-2.5-pro
python scripts/analyze-image.py image.jpg "Detect all objects" --model gemini-2.0-flash
python scripts/analyze-image.py img1.jpg img2.jpg "What's different between these?"
# Upload file
python scripts/upload-file.py path/to/large-image.jpg
# Use uploaded file
python scripts/analyze-image.py file://file-id "Caption this"
# List uploaded files
python scripts/manage-files.py list
# Get file info
python scripts/manage-files.py get file-id
# Delete file
python scripts/manage-files.py delete file-id
Images consume tokens based on size:
Token Formula:
crop_unit = floor(min(width, height) / 1.5)
tiles = (width / crop_unit) × (height / crop_unit)
total_tokens = tiles × 258
Example: 960×540 image = 6 tiles = 1,548 tokens
Limits vary by tier (Free, Tier 1, 2, 3):
Common errors:
See the references/ directory for:
When implementing Gemini vision features:
All scripts support the 3-step API key lookup:
Run any script with --help for detailed usage instructions.
Official Documentation: https://ai.google.dev/gemini-api/docs/image-understanding