Ingest YouTube content into vector database with embeddings for semantic search and RAG. Combines video download, transcription, chunking, embedding generation, and Supabase storage for intelligent content retrieval.
Store YouTube video content in a vector database with semantic embeddings for intelligent search and retrieval-augmented generation (RAG).
This skill provides end-to-end RAG storage for YouTube videos:
# Ingest a single video
python scripts/rag/ingest_video.py https://youtu.be/VIDEO_ID
# Search across all ingested content
python scripts/rag/semantic_search.py "RAG implementation best practices"
# Search within specific video
python scripts/rag/semantic_search.py "vector database" --video-id VIDEO_ID
Complete workflow from URL to searchable content:
YouTube URL
↓
[Download + Transcribe] (youtube-video-analysis)
↓
[Intelligent Chunking] (dockling_chunker)
↓
[Generate Embeddings] (OpenAI API)
↓
[Store in Supabase] (db_client)
↓
Searchable Vector Database
Features:
OpenAI text-embedding-3-small specifications:
Batch processing:
Vector similarity using cosine distance:
# Search by natural language query
results = semantic_search(
query="How to implement RAG with Claude",
limit=10,
min_similarity=0.7
)
# Filter by video or chunk type
results = semantic_search(
query="Python code examples",
video_id="dQw4w9WgXcQ",
chunk_type="code",
limit=5
)
Search capabilities:
Handles diverse content types:
Metadata preserved:
User: "Ingest this FastAPI tutorial into the knowledge base: https://youtu.be/example"
Skill Actions:
Output:
================================================================================
YouTube RAG Ingestion Pipeline
================================================================================
STEP 1: Download & Transcribe
[OK] Downloaded: FastAPI Complete Tutorial by TechWithTim
[OK] Duration: 45:30 (2730 seconds)
[OK] Transcribed: 18,500 characters
STEP 2: Intelligent Chunking
[OK] Created 48 chunks using Dockling
- 32 transcript chunks
- 12 code chunks
- 4 heading chunks
STEP 3: Generate Embeddings
[OK] Generated 48 embeddings (1536 dimensions each)
- Processed: 6,240 tokens
- Cost: $0.00012
STEP 4: Store in Database
[OK] Inserted video metadata
[OK] Stored 48 chunks with embeddings
- Video UUID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
================================================================================
INGESTION COMPLETE
================================================================================
Video ID: dQw4w9WgXcQ
Chunks stored: 48
Total cost: $0.00012
Processing time: 2m 15s
Ready for semantic search!
User: "Search my knowledge base for: implementing RAG with vector databases"
Skill Actions:
Output:
📺 Search Results for: "implementing RAG with vector databases"
================================================================================
1. Building Production RAG Systems by AI Jason (0.892 similarity)
Timestamp: 12:45
Type: transcript
"When implementing RAG, you need three core components: a vector database
like Supabase with pgvector, an embedding model like OpenAI's
text-embedding-3-small, and a chunking strategy that preserves context..."
2. Vector Databases Explained by Coding with Cole (0.874 similarity)
Timestamp: 08:20
Type: code
```python
def semantic_search(query: str, limit: int = 10):
# Generate query embedding
embedding = openai.Embedding.create(
model="text-embedding-3-small",
input=query
)
# Search database with cosine similarity
results = db.query(embedding, limit=limit)
return results
[7 more results...]
Total results: 10 Search time: 87ms
### Example 3: Multi-Video Research
**User**: "Find all mentions of 'Claude API integration' across my entire knowledge base"
**Skill Actions**:
1. Search across all stored videos
2. Group results by video
3. Show relevant sections with timestamps
**Output**:
📺 Found mentions in 4 videos:
Video 1: "Claude API Tutorial" by Anthropic Docs (3 mentions)
Video 2: "Building AI Agents" by AI Engineer (2 mentions)
Video 3: "FastAPI + Claude" by Python Tutorial (2 mentions)
Video 4: "Production AI Apps" by Tech With Tim (1 mention)
## Integration with Existing Systems
### YouTube Video Analysis Skill
```python
from youtube_video_analysis import download_video, extract_audio, transcribe_audio
# Reuse existing functionality
video_path, metadata = download_video(url, output_dir)
audio_path = extract_audio(video_path, output_dir)
transcript = transcribe_audio(audio_path, model_size='base')
from dockling_chunker import chunk_transcript_with_dockling
# Intelligent structure-aware chunking
chunks = chunk_transcript_with_dockling(
transcript=transcript,
video_metadata=metadata,
min_chunk_size=400,
max_chunk_size=1000,
overlap_tokens=50
)
from db_client import SupabaseClient
client = SupabaseClient()
video_uuid = client.insert_video(video_data)
client.insert_chunks(video_uuid, chunks)
results = client.semantic_search(query_embedding, limit=10)
client.close()
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
embeddings = [e.embedding for e in response.data]
-- Video metadata
CREATE TABLE youtube_videos (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
video_id TEXT UNIQUE NOT NULL,
url TEXT NOT NULL,
title TEXT NOT NULL,
author TEXT,
duration_seconds INTEGER,
views BIGINT,
description TEXT,
transcript_full TEXT,
visual_analysis JSONB,
metadata JSONB,
processed_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Transcript chunks with embeddings
CREATE TABLE transcript_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
video_id UUID REFERENCES youtube_videos(id) ON DELETE CASCADE,
chunk_text TEXT NOT NULL,
chunk_index INTEGER NOT NULL,
chunk_type TEXT NOT NULL,
timestamp_start NUMERIC,
timestamp_end NUMERIC,
embedding vector(1536) NOT NULL,
word_count INTEGER,
char_count INTEGER,
has_code BOOLEAN DEFAULT FALSE,
has_diagram BOOLEAN DEFAULT FALSE,
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW(),
UNIQUE(video_id, chunk_index)
);
-- Vector similarity index (IVFFlat)
CREATE INDEX idx_chunks_embedding ON transcript_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Ingestion Performance (1-hour video):
Search Performance:
Storage Requirements (per 1-hour video):
OpenAI Embeddings (per video):
1-hour video:
- Transcript: ~30K words = 40K tokens
- Chunked into: ~500 chunks
- Embedding tokens: 500 chunks × 130 tokens avg = 65K tokens
- Cost: 65K × $0.020 / 1M = $0.0013
3-hour video:
- Embedding tokens: ~195K tokens
- Cost: $0.0039
Cost per video: ~$0.001 to $0.004 (essentially FREE!)
Supabase Storage:
Free Tier: 500MB database storage
- ~147 videos at 3.4MB each
- Sufficient for prototyping and small-scale use
Pro Tier: $25/month for 8GB
- ~2,400 videos
- Production-ready scale
Total Cost (100 videos):
pip install openai psycopg2-binary pgvector python-dotenv
# .env file
OPENAI_API_KEY=sk-your-openai-api-key
SUPABASE_HOST=db.xxxxxxxxxxxx.supabase.co
SUPABASE_PASSWORD=your-supabase-password
python scripts/db/test_connection.py
Single video:
python scripts/rag/ingest_video.py https://youtu.be/VIDEO_ID
Multiple videos:
# Create a list of URLs
echo "https://youtu.be/VIDEO1" >> videos.txt
echo "https://youtu.be/VIDEO2" >> videos.txt
# Batch ingest
for url in $(cat videos.txt); do
python scripts/rag/ingest_video.py "$url"
done
With custom settings:
python scripts/rag/ingest_video.py \
https://youtu.be/VIDEO_ID \
--output-dir ./data/youtube \
--model-size small \
--min-chunk-size 400 \
--max-chunk-size 1000
Basic search:
python scripts/rag/semantic_search.py "RAG implementation with Claude"
Advanced search:
# Search with filters
python scripts/rag/semantic_search.py \
"Python code examples" \
--limit 20 \
--min-similarity 0.7 \
--chunk-type code
# Search within specific video
python scripts/rag/semantic_search.py \
"vector database setup" \
--video-id dQw4w9WgXcQ \
--limit 5
Programmatic usage:
from scripts.rag.semantic_search import search_youtube_content
results = search_youtube_content(
query="How to implement RAG",
limit=10,
min_similarity=0.7,
video_id=None # Search all videos
)
for result in results:
print(f"Video: {result['title']}")
print(f"Similarity: {result['similarity']:.3f}")
print(f"Text: {result['chunk_text'][:200]}...")
max_chunk_size for longer, detailed explanationsmin_chunk_size for short, dense contentoverlap_tokens to maintain context between chunksmin_similarity threshold (0.7 is good default)chunk_type for code vs text searchesIssue: "OpenAI API key not found"
# Solution: Set environment variable
export OPENAI_API_KEY=sk-your-api-key-here
# Or add to .env file
Issue: "Database connection failed"
# Solution: Check Supabase credentials
python scripts/db/test_connection.py
Issue: "Embedding dimension mismatch"
# Solution: Ensure using text-embedding-3-small (1536 dimensions)
# Check model name in generate_embeddings.py
Issue: "Out of memory during embedding generation"
# Solution: Reduce batch size
# In generate_embeddings.py, change batch_size from 2048 to 100
Issue: "Search returns no results"
# Solution: Lower similarity threshold
python scripts/rag/semantic_search.py "query" --min-similarity 0.5
Version: 1.0.0 Created: 2025-10-28 Status: Production Ready Task: 044-6 Dependencies: youtube-video-analysis, dockling_chunker, db_client, openai