Create ad-ready product video from product images, with or without character/subject images. The workflow leverages AI-powered image composition, scene understanding, and video generation. Video prompts should follow commercial shot language—visual hooks, product presence, hero shots, detail showcase, function expression, and dynamic visuals.
Input Requirements:
Process:
media_comprehension skillTrigger Condition: No character/subject image provided
Process:
image_generator with detailed prompt:
Output: Character image ready for composition
Objective: Create a realistic advertisement scene combining product + character + environment
Key Requirements:
Process:
image_generator with composition directive:
{
"content": "Compose [character description] with [product description] in [environment setting].
Requirements:
- Only ONE character in the scene
- Realistic home environment (floor, walls, natural lighting, plants, furniture)
- Natural interaction between character and product
- Professional product photography style",
"info": {
"image_urls": ["product.jpg", "character.jpg"],
"size": "1328x1328",
"guidance_scale": 4.5-5.0,
"num_inference_steps": 30-35,
"watermark": false,
"output_path": "./composed_ad_image.png"
}
}
Output: High-quality composed advertisement image with environment
Objective: Transform static composition into dynamic advertisement video
Shot & visual language (required): Across the ~10s runtime, the motion and camera work should cover these elements where applicable (not necessarily every second, but the final cut should feel like a mini commercial, not a single static pan):
| Element | Meaning |
|---|---|
| Visual hooks (视觉因子) | Strong focal points, contrast, color, light, or composition that hold attention |
| Product presence (产品出现) | Clear establishment of the product in frame—viewer knows what is being advertised |
| Product / hero shots (产品镜头) | Dedicated beats where the product is the clear subject (center framing, readable silhouette) |
| Detail showcase (细节展示) | Close-ups or slow emphasis on materials, texture, craftsmanship, or key parts |
| Function / benefit expression (功能表达) | Motion that implies use, outcome, or core selling point (interaction, before/after feel, problem–solution rhythm) |
| Dynamic visuals (动态视觉) | Varied motion: camera (push, pan, subtle orbit), parallax, light shifts, or subject micro-movement—avoid one flat move for the whole clip |
When writing video_diffusion prompts, spell out which of the above appear in sequence (e.g. establish product → detail → function beat → dynamic wrap). If the source image is character-heavy, still reserve beats for product-first shots.
Audio Handling Strategy:
video_diffusion:
{
"content": "Create dynamic advertisement video (mini-commercial pacing, ~10s):
- Visual hooks: strong focal points, light/color contrast where fitting
- Product presence: early establishment of the product in frame
- Product hero shots: beats where the product is clearly the subject
- Detail showcase: close-up or emphasis on texture/material/key parts
- Function expression: motion suggesting use, benefit, or core value
- Dynamic visuals: varied motion (camera push/pan/subtle orbit, parallax, light shifts, optional character micro-movements)
- Professional commercial quality",
"info": {
"image_url": "./composed_ad_image.png",
"resolution": "720p",
"duration": 10,
"fps": 24,
"output_dir": "./",
"sound": "off"
}
}
ffmpeg -i generated_video.mp4 -i user_audio.mp3 -t 10 \
-c:v copy -c:a aac -b:a 192k \
-map 0:v:0 -map 1:a:0 -shortest \
final_ad_video.mp4 -y
video_diffusion with audio generation enabled:
{
"content": "Create dynamic advertisement video with suitable background music (mini-commercial pacing, ~10s):
- Visual hooks; product presence; hero product shots; detail showcase; function/benefit expression; dynamic visuals (varied camera and motion)
- AI-generated background music matching product mood
- Professional commercial quality",
"info": {
"image_url": "./composed_ad_image.png",
"resolution": "720p",
"duration": 10,
"fps": 24,
"output_dir": "./",
"sound": "on"
}
}
Output: Final advertisement video (10 seconds, 720p, with audio)
media_comprehensionfrom PIL import Image
img = Image.open(path)
if img.mode in ('RGBA', 'LA', 'P'):
img = img.convert('RGB')
img.save(output_path, 'JPEG', quality=85, optimize=True)
ls *.mp3 to detect existing audio files-t 10 flag to match video duration| Issue | Solution |
|---|---|
| Multiple characters appear in composition | Add explicit constraint in prompt: "ONLY ONE [character], no other characters" |
| Plain white background | Specify environment details: "in a modern living room with wooden floor, beige walls, natural window light" |
| Image file too large | Compress before analysis using provided Python script |
| Audio sync issues | Ensure -shortest flag in FFmpeg to trim to shortest stream |
| Video generation timeout | Use background task spawning for long operations |
This workflow is product-agnostic and can be applied to:
Input: cat_tower.jpg, calico_cat.jpg
→ Compose: Cat on tower in cozy living room
→ Video: 10s with gentle camera pan + user's "Cat Republic.mp3"
Output: final_ad_video.mp4
Input: modern_sofa.jpg
→ Generate: Lifestyle character reading on sofa
→ Compose: Character + sofa in bright apartment
→ Video: 10s with AI-generated ambient music
Output: final_ad_video.mp4
Input: wireless_earbuds.jpg
→ Generate: Hands holding earbuds
→ Compose: Hands + earbuds on minimalist desk
→ Video: 10s with AI-generated tech music
Output: final_ad_video.mp4
media_comprehension skill for image analysisimage_generator for composition and character generationvideo_diffusion for video creationproduct_*.jpg/pngcharacter_*.jpg/png or descriptive names*.mp3composed_ad_image.pngfinal_ad_video.mp4 or [product_name]_ad.mp4This workflow provides a systematic, generalizable approach to advertisement video creation that:
By following these guidelines, future users can efficiently create compelling advertisement videos for diverse product categories without overfitting to specific examples.