Singularity Data Dominion

Transcendent data engineering and database mastery for LitigationOS. Use when: SQL queries, DuckDB analytics, LanceDB vectors, Polars DataFrames, FTS5 search, RAG pipelines, SQLite optimization, schema design, data migration, cross-database federation, vector embeddings, semantic search, indexing strategy, query optimization, connection pooling, WAL mode, PRAGMA tuning, batch operations.

Beruf
Kategorien: Data Engineering

SINGULARITY-data-dominion — Transcendent Data Engineering

Version: 2.0.0 | Tier: CORE | Domain: Data Engineering & Database Mastery Absorbs: data-engineering + database-mastery + rag-memory Activation: "SQL", "query", "database", "DuckDB", "LanceDB", "vector", "FTS5", "Polars", "analytics", "RAG", "embedding", "schema"

Layer 1: SQLite Mastery (litigation_context.db — 1.3 GB, 790+ tables)

Connection Setup — Three-Tier Strategy

Every connection MUST set these PRAGMAs. Missing any = guaranteed SQLITE_BUSY under load.

import sqlite3
from shared import get_db, sanitize_fts5, config

# Tier 1 — Multiplexer (hot path, high-throughput)
conn = get_db("litigation_context")
conn.execute("PRAGMA busy_timeout = 180000")    # 3 min
conn.execute("PRAGMA journal_mode = WAL")
conn.execute("PRAGMA cache_size = -131072")      # 128 MB
conn.execute("PRAGMA mmap_size = 12884901888")   # 12 GB on NVMe
conn.execute("PRAGMA temp_store = MEMORY")
conn.execute("PRAGMA synchronous = NORMAL")

# Tier 2 — Standard (MCP, daemon, engines)
conn.execute("PRAGMA busy_timeout = 60000")      # 60 s
conn.execute("PRAGMA cache_size = -32000")        # 32 MB
conn.execute("PRAGMA temp_store = MEMORY")
conn.execute("PRAGMA synchronous = NORMAL")

# Tier 3 — Simple (one-off scripts, temp queries)
conn.execute("PRAGMA busy_timeout = 30000")      # 30 s
conn.execute("PRAGMA cache_size = -8000")         # 8 MB

Singularity Data Dominion

Beruf
Kategorien: Data Engineering

Layer 1: SQLite Mastery (litigation_context.db — 1.3 GB, 790+ tables)

Connection Setup — Three-Tier Strategy

Every connection MUST set these PRAGMAs. Missing any = guaranteed SQLITE_BUSY under load.

import sqlite3 from shared import get_db, sanitize_fts5, config # Tier 1 — Multiplexer (hot path, high-throughput) conn = get_db("litigation_context") conn.execute("PRAGMA busy_timeout = 180000") # 3 min conn.execute("PRAGMA journal_mode = WAL") conn.execute("PRAGMA cache_size = -131072") # 128 MB conn.execute("PRAGMA mmap_size = 12884901888") # 12 GB on NVMe conn.execute("PRAGMA temp_store = MEMORY") conn.execute("PRAGMA synchronous = NORMAL") # Tier 2 — Standard (MCP, daemon, engines) conn.execute("PRAGMA busy_timeout = 60000") # 60 s conn.execute("PRAGMA cache_size = -32000") # 32 MB conn.execute("PRAGMA temp_store = MEMORY") conn.execute("PRAGMA synchronous = NORMAL") # Tier 3 — Simple (one-off scripts, temp queries) conn.execute("PRAGMA busy_timeout = 30000") # 30 s conn.execute("PRAGMA cache_size = -8000") # 8 MB

Query Type	Use DuckDB	Use SQLite
GROUP BY with 100K+ rows	✅ 10-100× faster	❌ Slow
Window functions (RANK, LAG)	✅ Optimized	⚠️ Works but slower
Complex CTEs with aggregation	✅ Columnar advantage	❌ Row-store penalty
Single-row INSERT/UPDATE	❌ Not designed for OLTP	✅ Fast
FTS5 text search	❌ No FTS5	✅ Native support
Point lookups by ID	❌ Overhead not worth it	✅ Sub-ms with index
Analytical dashboards	✅ Purpose-built	❌ Too slow

Need	Method	Tool
Exact phrase match	FTS5 with quotes	`search_evidence`
Conceptual similarity	Vector search	`vector_search`
Best overall relevance	Hybrid (BM25 + vector)	Custom fusion
Cross-exam ammunition	Impeachment search	`search_impeachment`
Specific citation lookup	Authority search	`search_authority_chains`

#	Anti-Pattern	Correct Pattern
1	`LIKE '%term%'` when FTS5 exists	FTS5 MATCH with sanitization + LIKE fallback
2	Hardcoded DB paths `r"C:\Users\andre\..."`	`shared.get_db()` or `shared.get_db_path()`
3	pandas for DataFrames	Polars (2-10× faster, lazy evaluation)
4	Query without `PRAGMA table_info()` on unfamiliar tables	Always introspect schema first
5	`OFFSET` pagination on 100K+ tables	Cursor-based `WHERE id > :last_seen`
6	Row-by-row INSERT in loops	`executemany()` batch insert
7	Multiple separate `COUNT(*)` calls	Single query with subqueries
8	`json.load()` for large JSON	`orjson` (small) or `ijson` streaming (large)
9	WAL mode on exFAT (J:\ drive)	DELETE mode or `immutable=1` URI
10	Missing `PRAGMA busy_timeout`	Always set ≥30000 ms
11	`SELECT *` in hot paths	Explicit column lists
12	Cosine similarity alone for contradictions	Two-stage: bi-encoder → cross-encoder
13	No commit after batch insert	`conn.commit()` after `executemany`
14	Opening DB inside shell commands	Dedicated Python scripts with proper PRAGMAs
15	Trusting `CREATE TABLE IF NOT EXISTS` for schema	It silently skips different schemas — introspect

Operation	Target	Degraded	Unacceptable
Single-row lookup (indexed)	< 1 ms	< 5 ms	> 50 ms
FTS5 search (25 results)	< 10 ms	< 50 ms	> 200 ms
DuckDB GROUP BY (100K rows)	< 50 ms	< 200 ms	> 1 s
Vector search (top-10)	< 20 ms	< 100 ms	> 500 ms
Batch insert (1000 rows)	< 100 ms	< 500 ms	> 2 s
Cross-DB ATTACH + query	< 200 ms	< 1 s	> 5 s
Cross-encoder rerank (50 pairs)	< 500 ms	< 2 s	> 5 s
Full hybrid search pipeline	< 300 ms	< 1 s	> 3 s

Task	Primary Tool	Fallback	Why
Point lookup by ID	SQLite	—	Sub-ms with index
Full-text keyword search	FTS5 + BM25	LIKE fallback	Ranked relevance
Semantic similarity	LanceDB vector	FTS5 keyword	Conceptual matching
Analytical aggregation	DuckDB	SQLite GROUP BY	10-100× columnar
DataFrame manipulation	Polars	DuckDB SQL	Lazy eval, zero-copy
JSON parsing (< 100 MB)	orjson	json stdlib	10× speed
JSON parsing (> 100 MB)	ijson streaming	—	O(1) memory
PDF text extraction	pypdfium2	PyMuPDF	5× faster
Schema validation	msgspec.Struct	—	10-80× vs pydantic
Cross-table fusion	nexus_fuse tool	Manual JOINs	5 sources at once

Singularity Data Dominion

SINGULARITY-data-dominion — Transcendent Data Engineering

Layer 1: SQLite Mastery (litigation_context.db — 1.3 GB, 790+ tables)

Connection Setup — Three-Tier Strategy

Singularity Data Dominion

SINGULARITY-data-dominion — Transcendent Data Engineering

Layer 1: SQLite Mastery (litigation_context.db — 1.3 GB, 790+ tables)

Connection Setup — Three-Tier Strategy

FTS5 Safety Protocol (Rule 15 — MANDATORY)

Schema Introspection (Rule 16 — Before ANY Unfamiliar Table)

Batch Operations (10-100× Faster)

Consolidate COUNT(*) Calls

Composite Indexes for Hot Queries

Cross-Database Federation with ATTACH

Layer 2: DuckDB Analytics (10-100× Faster Than SQLite OLAP)

When to Use DuckDB vs SQLite

DuckDB + SQLite Scanner Integration

Litigation Analytics Patterns

Layer 3: LanceDB Vector Search (75K Vectors, 384-dim)

Embedding Generation

Vector Search via NEXUS Tool

Hybrid Search: BM25 + Vector Fusion

When to Use Each Search Mode

Cross-Encoder Reranking (25-35% MRR Boost)

Layer 4: Polars DataFrames (2-10× Faster Than pandas)

Lazy Evaluation for Large Datasets

DuckDB → Polars Integration

Layer 5: RAG Pipeline Architecture

End-to-End Pipeline

PDF Extraction (pypdfium2 — 5× Faster Than PyMuPDF)

Anti-Patterns (VIOLATIONS = IMMEDIATE FAILURE)

Performance Budgets

Decision Matrix: Which Tool for Which Data Task

Key NEXUS Extension Tools

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns

Tool	Action	Use For
`query_litigation_db`	`query`	Parameterized SQL (read/write)
`search_evidence`	`search_evidence`	FTS5 evidence search
`vector_search`	`vector_search`	Semantic similarity
`nexus_fuse`	`nexus_fuse`	Cross-table fusion
`search_impeachment`	`search_impeachment`	Cross-exam ammunition
`search_contradictions`	`search_contradictions`	Adversary inconsistencies
`search_authority_chains`	`search_authority`	Citation chain lookup

Singularity Data Dominion

SINGULARITY-data-dominion — Transcendent Data Engineering

Layer 1: SQLite Mastery (litigation_context.db — 1.3 GB, 790+ tables)

Connection Setup — Three-Tier Strategy

Singularity Data Dominion

SINGULARITY-data-dominion — Transcendent Data Engineering

Layer 1: SQLite Mastery (litigation_context.db — 1.3 GB, 790+ tables)

Connection Setup — Three-Tier Strategy

FTS5 Safety Protocol (Rule 15 — MANDATORY)

Schema Introspection (Rule 16 — Before ANY Unfamiliar Table)

Batch Operations (10-100× Faster)

Consolidate COUNT(*) Calls

Composite Indexes for Hot Queries

Cursor-Based Pagination (NOT OFFSET)

Cross-Database Federation with ATTACH

Layer 2: DuckDB Analytics (10-100× Faster Than SQLite OLAP)

When to Use DuckDB vs SQLite

DuckDB + SQLite Scanner Integration

Litigation Analytics Patterns

Layer 3: LanceDB Vector Search (75K Vectors, 384-dim)

Embedding Generation

Vector Search via NEXUS Tool

Hybrid Search: BM25 + Vector Fusion

When to Use Each Search Mode

Cross-Encoder Reranking (25-35% MRR Boost)

Layer 4: Polars DataFrames (2-10× Faster Than pandas)

Lazy Evaluation for Large Datasets

DuckDB → Polars Integration

Layer 5: RAG Pipeline Architecture

End-to-End Pipeline

PDF Extraction (pypdfium2 — 5× Faster Than PyMuPDF)

Anti-Patterns (VIOLATIONS = IMMEDIATE FAILURE)

Performance Budgets

Decision Matrix: Which Tool for Which Data Task

Key NEXUS Extension Tools

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns