Name: Youtube Rag Storage
Author: wrm3

Youtube Rag Storage | Skills Pool

# Ingest a single video
python scripts/rag/ingest_video.py https://youtu.be/VIDEO_ID

# Search across all ingested content
python scripts/rag/semantic_search.py "RAG implementation best practices"

# Search within specific video
python scripts/rag/semantic_search.py "vector database" --video-id VIDEO_ID

YouTube URL
    ↓
[Download + Transcribe] (youtube-video-analysis)
    ↓
[Intelligent Chunking] (dockling_chunker)
    ↓
[Generate Embeddings] (OpenAI API)
    ↓
[Store in Supabase] (db_client)
    ↓
Searchable Vector Database

# Search by natural language query
results = semantic_search(
    query="How to implement RAG with Claude",
    limit=10,
    min_similarity=0.7
)

# Filter by video or chunk type
results = semantic_search(
    query="Python code examples",
    video_id="dQw4w9WgXcQ",
    chunk_type="code",
    limit=5
)

================================================================================
YouTube RAG Ingestion Pipeline
================================================================================

STEP 1: Download & Transcribe
[OK] Downloaded: FastAPI Complete Tutorial by TechWithTim
[OK] Duration: 45:30 (2730 seconds)
[OK] Transcribed: 18,500 characters

STEP 2: Intelligent Chunking
[OK] Created 48 chunks using Dockling
     - 32 transcript chunks
     - 12 code chunks
     - 4 heading chunks

STEP 3: Generate Embeddings
[OK] Generated 48 embeddings (1536 dimensions each)
     - Processed: 6,240 tokens
     - Cost: $0.00012

STEP 4: Store in Database
[OK] Inserted video metadata
[OK] Stored 48 chunks with embeddings
     - Video UUID: a1b2c3d4-e5f6-7890-abcd-ef1234567890

================================================================================
INGESTION COMPLETE
================================================================================

Video ID: dQw4w9WgXcQ
Chunks stored: 48
Total cost: $0.00012
Processing time: 2m 15s
Ready for semantic search!

📺 Search Results for: "implementing RAG with vector databases"

================================================================================

1. Building Production RAG Systems by AI Jason (0.892 similarity)
   Timestamp: 12:45
   Type: transcript
   "When implementing RAG, you need three core components: a vector database
   like Supabase with pgvector, an embedding model like OpenAI's
   text-embedding-3-small, and a chunking strategy that preserves context..."

2. Vector Databases Explained by Coding with Cole (0.874 similarity)
   Timestamp: 08:20
   Type: code
   ```python
   def semantic_search(query: str, limit: int = 10):
       # Generate query embedding
       embedding = openai.Embedding.create(
           model="text-embedding-3-small",
           input=query
       )
       # Search database with cosine similarity
       results = db.query(embedding, limit=limit)
       return results


### Example 3: Multi-Video Research

**User**: "Find all mentions of 'Claude API integration' across my entire knowledge base"

**Skill Actions**:
1. Search across all stored videos
2. Group results by video
3. Show relevant sections with timestamps

**Output**:


## Integration with Existing Systems

### YouTube Video Analysis Skill
```python
from youtube_video_analysis import download_video, extract_audio, transcribe_audio

# Reuse existing functionality
video_path, metadata = download_video(url, output_dir)
audio_path = extract_audio(video_path, output_dir)
transcript = transcribe_audio(audio_path, model_size='base')

from dockling_chunker import chunk_transcript_with_dockling

# Intelligent structure-aware chunking
chunks = chunk_transcript_with_dockling(
    transcript=transcript,
    video_metadata=metadata,
    min_chunk_size=400,
    max_chunk_size=1000,
    overlap_tokens=50
)

from db_client import SupabaseClient

client = SupabaseClient()
video_uuid = client.insert_video(video_data)
client.insert_chunks(video_uuid, chunks)
results = client.semantic_search(query_embedding, limit=10)
client.close()

from openai import OpenAI

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts
)
embeddings = [e.embedding for e in response.data]

-- Video metadata
CREATE TABLE youtube_videos (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    video_id TEXT UNIQUE NOT NULL,
    url TEXT NOT NULL,
    title TEXT NOT NULL,
    author TEXT,
    duration_seconds INTEGER,
    views BIGINT,
    description TEXT,
    transcript_full TEXT,
    visual_analysis JSONB,
    metadata JSONB,
    processed_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Transcript chunks with embeddings
CREATE TABLE transcript_chunks (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    video_id UUID REFERENCES youtube_videos(id) ON DELETE CASCADE,
    chunk_text TEXT NOT NULL,
    chunk_index INTEGER NOT NULL,
    chunk_type TEXT NOT NULL,
    timestamp_start NUMERIC,
    timestamp_end NUMERIC,
    embedding vector(1536) NOT NULL,
    word_count INTEGER,
    char_count INTEGER,
    has_code BOOLEAN DEFAULT FALSE,
    has_diagram BOOLEAN DEFAULT FALSE,
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW(),
    UNIQUE(video_id, chunk_index)
);

-- Vector similarity index (IVFFlat)
CREATE INDEX idx_chunks_embedding ON transcript_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

1-hour video:
- Transcript: ~30K words = 40K tokens
- Chunked into: ~500 chunks
- Embedding tokens: 500 chunks × 130 tokens avg = 65K tokens
- Cost: 65K × $0.020 / 1M = $0.0013

3-hour video:
- Embedding tokens: ~195K tokens
- Cost: $0.0039

Cost per video: ~$0.001 to $0.004 (essentially FREE!)

Free Tier: 500MB database storage
- ~147 videos at 3.4MB each
- Sufficient for prototyping and small-scale use

Pro Tier: $25/month for 8GB
- ~2,400 videos
- Production-ready scale

pip install openai psycopg2-binary pgvector python-dotenv

# .env file
OPENAI_API_KEY=sk-your-openai-api-key
SUPABASE_HOST=db.xxxxxxxxxxxx.supabase.co
SUPABASE_PASSWORD=your-supabase-password

python scripts/db/test_connection.py

python scripts/rag/ingest_video.py https://youtu.be/VIDEO_ID

# Create a list of URLs
echo "https://youtu.be/VIDEO1" >> videos.txt
echo "https://youtu.be/VIDEO2" >> videos.txt

# Batch ingest
for url in $(cat videos.txt); do
    python scripts/rag/ingest_video.py "$url"
done

python scripts/rag/ingest_video.py \
    https://youtu.be/VIDEO_ID \
    --output-dir ./data/youtube \
    --model-size small \
    --min-chunk-size 400 \
    --max-chunk-size 1000

python scripts/rag/semantic_search.py "RAG implementation with Claude"

# Search with filters
python scripts/rag/semantic_search.py \
    "Python code examples" \
    --limit 20 \
    --min-similarity 0.7 \
    --chunk-type code

# Search within specific video
python scripts/rag/semantic_search.py \
    "vector database setup" \
    --video-id dQw4w9WgXcQ \
    --limit 5

from scripts.rag.semantic_search import search_youtube_content

results = search_youtube_content(
    query="How to implement RAG",
    limit=10,
    min_similarity=0.7,
    video_id=None  # Search all videos
)

for result in results:
    print(f"Video: {result['title']}")
    print(f"Similarity: {result['similarity']:.3f}")
    print(f"Text: {result['chunk_text'][:200]}...")

# Solution: Set environment variable
export OPENAI_API_KEY=sk-your-api-key-here
# Or add to .env file

# Solution: Check Supabase credentials
python scripts/db/test_connection.py

# Solution: Ensure using text-embedding-3-small (1536 dimensions)
# Check model name in generate_embeddings.py

# Solution: Reduce batch size
# In generate_embeddings.py, change batch_size from 2048 to 100

# Solution: Lower similarity threshold
python scripts/rag/semantic_search.py "query" --min-similarity 0.5

Youtube Rag Storage

YouTube RAG Storage Skill

Overview

When to Use This Skill

Automatic Triggers

Youtube Rag Storage

YouTube RAG Storage Skill

Overview

When to Use This Skill

Automatic Triggers

Manual Invocation

Core Capabilities

1. Video Ingestion Pipeline

2. Embedding Generation

3. Semantic Search

4. Multi-Modal Content Support

Workflow Examples

Example 1: Ingest Tutorial Video

Example 2: Semantic Search

Dockling Chunker

Supabase Database Client

OpenAI Embeddings

Technical Specifications

Embedding Model

Database Schema

Performance Characteristics

Cost Analysis

Usage Instructions

Setup (One-time)

Ingest Videos

Search Content

Best Practices

Video Selection

Chunking Strategy

Embedding Quality

Search Optimization

Cost Management

Limitations

Current Limitations

Performance Constraints

Troubleshooting

Common Issues

Future Enhancements

Planned Features (Task 044 Roadmap)

Potential Improvements

Reference Documentation

Examples

Scripts

Vector Index Tuning

Azure Resource Manager Redis Dotnet

Redis Expert

Elasticsearch

Cache Expert

Abp Mongodb