Process and generate multimedia content with Google Gemini when advanced multimodal processing beyond basic file reading is needed — Gemini-powered analysis, generation, or transformation of audio, images, video, and documents. Use when working with screenshots requiring OCR, PDF extraction, audio transcription, visual analysis, structured extraction, image generation, video generation, media pre-processing, or SDK-level multimodal integrations. Not needed when Claude can handle the task natively (e.g., viewing an image or reading a short PDF). Includes bundled CLI scripts for analyze/transcribe/extract/generate/generate-video plus deeper references for advanced features like TTS, image editing, YouTube analysis, structured JSON schemas, and live music generation.
Process audio, images, videos, documents, and generated assets using Google Gemini's multimodal API.
export GEMINI_API_KEY="your-key" # Get from https://aistudio.google.com/apikey
pip install -r scripts/requirements.txt
Optional for media optimization workflows:
brew install ffmpeg # or your platform equivalent
For high-volume usage or when hitting rate limits, configure multiple API keys:
# Primary key (required)
export GEMINI_API_KEY="key1"
# Additional keys for rotation (optional)
export GEMINI_API_KEY_2="key2"
export GEMINI_API_KEY_3="key3"
Or in your .env file:
GEMINI_API_KEY=key1
GEMINI_API_KEY_2=key2
GEMINI_API_KEY_3=key3
Features:
--verbose flagVerify setup: python scripts/check_setup.py
Analyze media: python scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract>
Generate content: python scripts/gemini_batch_process.py --task <generate|generate-video> --prompt "description"
Convert docs to Markdown: python scripts/document_converter.py --input <file>
Preflight media: python scripts/media_optimizer.py --input <file> --output <optimized-file>
Stdin support: The standalone CLI supports piping supported binary files through stdin (PNG/JPG/PDF/WAV/MP3/WEBP).
cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "Describe this"python scripts/gemini_batch_process.py --files image.png --task analyze(traditional)
imagen-4.0-generate-001 (standard), imagen-4.0-ultra-generate-001 (quality), imagen-4.0-fast-generate-001 (speed)veo-3.1-generate-preview (8s clips with audio)gemini-2.5-flash (recommended), gemini-2.5-pro (advanced)gemini_batch_process.py: CLI orchestrator for transcribe|analyze|extract|generate|generate-video that auto-resolves API keys, picks sensible default models per task, streams files inline vs File API, and saves structured outputs (text/JSON/CSV/markdown plus generated assets) for Imagen 4 + Veo workflows.media_optimizer.py: ffmpeg/Pillow-based preflight tool that compresses/resizes/converts audio, image, and video inputs, enforces target sizes/bitrates, splits long clips into hour chunks, and batch-processes directories so media stays within Gemini limits.document_converter.py: Gemini-powered converter that uploads PDFs/images/Office docs, applies a markdown-preserving prompt, batches multiple files, auto-names outputs under docs/assets, and exposes CLI flags for model, prompt, auto-file naming, and verbose logging.check_setup.py: Interactive readiness checker that verifies directory layout, centralized env resolver, required Python deps, and GEMINI_API_KEY availability/format, then performs a live Gemini API call and prints remediation instructions if anything fails.Use --help for options.
Keep these surfaces distinct:
gemini_batch_process.pydocument_converter.pymedia_optimizer.pycheck_setup.pyDo not assume every capability documented in references/ is already exposed by the CLI.
Load for detailed guidance:
| Topic | File | Description |
|---|---|---|
| Music | references/music-generation.md | Lyria RealTime API for background music generation, style prompts, real-time control, integration with video production. |
| Audio | references/audio-processing.md | Audio formats and limits, transcription (timestamps, speakers, segments), non-speech analysis, File API vs inline input, TTS models, best practices, cost and token math, and concrete meeting/podcast/interview recipes. |
| Images | references/vision-understanding.md | Vision capabilities overview, supported formats and models, captioning/classification/VQA, detection and segmentation, OCR and document reading, multi-image workflows, structured JSON output, token costs, best practices, and common product/screenshot/chart/scene use cases. |
| Image Gen | references/image-generation.md | Imagen 4 and Gemini image model overview, generate_images vs generate_content APIs, aspect ratios and costs, text/image/both modalities, editing and composition, style and quality control, safety settings, best practices, troubleshooting, and common marketing/concept-art/UI scenarios. |
| Video | references/video-analysis.md | Video analysis capabilities and supported formats, model/context choices, local/inline/YouTube inputs, clipping and FPS control, multi-video comparison, temporal Q&A and scene detection, transcription with visual context, token and cost guidance, and optimization/best-practice patterns. |
| Video Gen | references/video-generation.md | Veo model matrix, text-to-video and image-to-video quick start, multi-reference and extension flows, camera and timing control, configuration (resolution, aspect, audio, safety), prompt design patterns, performance tips, limitations, troubleshooting, and cost estimates. |
Formats: Audio (WAV/MP3/AAC, 9.5h), Images (PNG/JPEG/WEBP, 3.6k), Video (MP4/MOV, 6h), PDF (1k pages) Size: 20MB inline, 2GB File API Important:
--num-images is meaningful on the Imagen path; do not assume the same behavior for every Gemini image model.
Transcription Output Requirements:[HH:MM:SS -> HH:MM:SS] transcript content
[HH:MM:SS -> HH:MM:SS] transcript content
...
During pulse:executing, invoke this skill when beads require multimodal asset processing, media analysis, or content generation. Feed outputs back into the bead's canonical verification evidence at history/<feature>/verification/<bead-id>.md.
Write/Edit tools are available for extending or creating processing scripts, not for modifying reference documentation.