Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".
This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:
Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.
npm install @google/genai
curl; if you need to parse image Base64, install jq (optional).GEMINI_API_KEYx-goog-api-key: $GEMINI_API_KEYInline (embedded bytes/Base64)
Files API (upload then reference)
files.upload(...) (SDK) or POST /upload/v1beta/files (REST resumable)file_data / file_uri in generateContentEngineering suggestion: implement
ensure_file_uri()so that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.
inline_data (Base64) in response parts; in the SDK use part.as_image() or decode Base64 and save as PNG/JPG..pcm or wrap into .wav (commonly 24kHz, 16-bit, mono).Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.
gemini-3-flash-preview for image, video, and audio understanding (choose stronger models as needed for quality/cost).veo-3.1-generate-preview (generates 8-second video and can natively generate audio).gemini-2.5-flash-preview-tts (native TTS, currently in preview).SDK (Node.js) minimal template
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-image",
contents:
"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
});
const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.text) console.log(part.text);
if (part.inlineData?.data) {
fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64"));
}
}
REST (with imageConfig) minimal template
curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" -H "x-goog-api-key: $GEMINI_API_KEY" -H "Content-Type: application/json" -d '{
"contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}],
"generationConfig": {"imageConfig": {"aspectRatio":"16:9"}}
}'
REST image parsing (Base64 decode)
curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \
| jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \
| base64 --decode > out.png
# macOS can use: base64 -D > out.png
Use case: given an image, add/remove/modify elements, change style, color grading, etc.
SDK (Node.js) minimal template
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const prompt =
"Add a nano banana on the table, keep lighting consistent, cinematic tone.";
const imageBase64 = fs.readFileSync("input.png").toString("base64");
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-image",
contents: [
{ text: prompt },
{ inlineData: { mimeType: "image/png", data: imageBase64 } },
],
});
const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.inlineData?.data) {
fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64"));
}
}
Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style").
To output mixed "text + image" results, set response_modalities to ["TEXT", "IMAGE"].
You can set in generationConfig.imageConfig or the SDK config:
aspectRatio: e.g. 16:9, 1:1.imageSize: e.g. 2K, 4K (higher resolution is usually slower/more expensive and model support can vary).import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const imageBase64 = fs.readFileSync("image.jpg").toString("base64");
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: [
{ inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
{ text: "Caption this image, and list any visible brands." },
],
});
console.log(response.text);
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "image.jpg" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
createPartFromUri(uploaded.uri, uploaded.mimeType),
"Caption this image.",
]),
});
console.log(response.text);
Append multiple images as multiple Part entries in the same contents; you can mix uploaded references and inline bytes.
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const prompt =
"A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.";
let operation = await ai.models.generateVideos({
model: "veo-3.1-generate-preview",
prompt,
config: { resolution: "1080p" },
});
while (!operation.done) {
await new Promise((resolve) => setTimeout(resolve, 10_000));
operation = await ai.operations.getVideosOperation({ operation });
}
const video = operation.response?.generatedVideos?.[0]?.video;
if (!video) throw new Error("No video returned");
await ai.files.download({ file: video, downloadPath: "out.mp4" });
Key point: Veo REST uses :predictLongRunning to return an operation name, then poll GET /v1beta/{operation_name}; once done, download from the video URI in the response.
aspectRatio: "16:9" or "9:16"resolution: "720p" | "1080p" | "4k" (higher resolutions are usually slower/more expensive)Polling fallback (with timeout/backoff) pseudocode
const deadline = Date.now() + 300_000; // 5 min
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
await new Promise((resolve) => setTimeout(resolve, sleepMs));
sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp4" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
createPartFromUri(uploaded.uri, uploaded.mimeType),
"Summarize this video. Provide timestamps for key events.",
]),
});
console.log(response.text);
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-preview-tts",
contents: [{ parts: [{ text: "Say cheerfully: Have a wonderful day!" }] }],
config: {
responseModalities: ["AUDIO"],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: "Kore" },
},
},
},
});
const data =
response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data ?? "";
if (!data) throw new Error("No audio returned");
fs.writeFileSync("out.pcm", Buffer.from(data, "base64"));
Requirements:
multiSpeakerVoiceConfigvoice_name supports 30 prebuilt voices (for example Zephyr, Puck, Charon, Kore, etc.).Provide controllable directions for style, pace, accent, etc., but avoid over-constraining.
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp3" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
"Describe this audio clip.",
createPartFromUri(uploaded.uri, uploaded.mimeType),
]),
});
console.log(response.text);