스킬 파일

AI Multimodal

Name: AI Multimodal
Author: quanpersie2001

Process and generate multimedia content with Google Gemini when advanced multimodal processing beyond basic file reading is needed — Gemini-powered analysis, generation, or transformation of audio, images, video, and documents. Use when working with screenshots requiring OCR, PDF extraction, audio transcription, visual analysis, structured extraction, image generation, video generation, media pre-processing, or SDK-level multimodal integrations. Not needed when Claude can handle the task natively (e.g., viewing an image or reading a short PDF). Includes bundled CLI scripts for analyze/transcribe/extract/generate/generate-video plus deeper references for advanced features like TTS, image editing, YouTube analysis, structured JSON schemas, and live music generation.

quanpersie20010 스타2026. 4. 16.

직업
카테고리: 미디어

스킬 내용

Process audio, images, videos, documents, and generated assets using Google Gemini's multimodal API.

Setup

export GEMINI_API_KEY="your-key"  # Get from https://aistudio.google.com/apikey
pip install -r scripts/requirements.txt

Optional for media optimization workflows:

brew install ffmpeg  # or your platform equivalent

API Key Rotation (Optional)

For high-volume usage or when hitting rate limits, configure multiple API keys:

# Primary key (required)
export GEMINI_API_KEY="key1"

# Additional keys for rotation (optional)
export GEMINI_API_KEY_2="key2"
export GEMINI_API_KEY_3="key3"

Or in your .env file:

관련 스킬

AI Multimodal | Skills Pool

GEMINI_API_KEY=key1
GEMINI_API_KEY_2=key2
GEMINI_API_KEY_3=key3

Topic	File	Description
Music	`references/music-generation.md`	Lyria RealTime API for background music generation, style prompts, real-time control, integration with video production.
Audio	`references/audio-processing.md`	Audio formats and limits, transcription (timestamps, speakers, segments), non-speech analysis, File API vs inline input, TTS models, best practices, cost and token math, and concrete meeting/podcast/interview recipes.
Images	`references/vision-understanding.md`	Vision capabilities overview, supported formats and models, captioning/classification/VQA, detection and segmentation, OCR and document reading, multi-image workflows, structured JSON output, token costs, best practices, and common product/screenshot/chart/scene use cases.
Image Gen	`references/image-generation.md`	Imagen 4 and Gemini image model overview, generate_images vs generate_content APIs, aspect ratios and costs, text/image/both modalities, editing and composition, style and quality control, safety settings, best practices, troubleshooting, and common marketing/concept-art/UI scenarios.
Video	`references/video-analysis.md`	Video analysis capabilities and supported formats, model/context choices, local/inline/YouTube inputs, clipping and FPS control, multi-video comparison, temporal Q&A and scene detection, transcription with visual context, token and cost guidance, and optimization/best-practice patterns.
Video Gen	`references/video-generation.md`	Veo model matrix, text-to-video and image-to-video quick start, multi-reference and extension flows, camera and timing control, configuration (resolution, aspect, audio, safety), prompt design patterns, performance tips, limitations, troubleshooting, and cost estimates.

If you are going to generate a transcript of the audio, and the audio length is longer than 15 minutes, the transcript often gets truncated due to output token limits in the Gemini API response. To get the full transcript, you need to split the audio into smaller chunks (max 15 minutes per chunk) and transcribe each segment for a complete transcript.
If you are going to generate a transcript of the video and the video length is longer than 15 minutes, use ffmpeg to extract the audio from the video, truncate the audio to 15 minutes, transcribe all audio segments, and then combine the transcripts into a single transcript.
Image and video generation require billing; free tier has zero quota for those models.
The current video generation CLI uses the first reference image as the opening frame and the second as the closing frame.
--num-images is meaningful on the Imagen path; do not assume the same behavior for every Gemini image model. Transcription Output Requirements:
Format: Markdown
Metadata: Duration, file size, generated date, description, file name, topics covered, etc.
Parts: from-to (e.g., 00:00-00:15), audio chunk name, transcript, status, etc.

Transcript format:

[HH:MM:SS -> HH:MM:SS] transcript content
[HH:MM:SS -> HH:MM:SS] transcript content
...

AI Multimodal

Setup

API Key Rotation (Optional)

AI Multimodal

Setup

API Key Rotation (Optional)

Quick Start

Models

Scripts

Runtime Surfaces

References

Limits

Resources

Integration with Pulse

Songsee

Video Frames

Gifgrep

Qqbot Media

Camsnap

Openai Whisper Api