Build, maintain, and extend the Ariadne Core codebase. Triggers: modify code, fix bugs, add features, write tests, repo structure questions.
Use this skill for modifying code, fixing bugs, adding features, updating configuration, writing tests, or any changes to the Ariadne Core repo. Also covers repo structure, design decisions, architecture, and file sync questions.
For using Ariadne Core as an end user (ingesting documents, searching, etc.), use the ariadne-document-intelligence skill instead.
This skill requires Claude Code or any agent with terminal access. It involves running tests, editing source files, and executing build commands. For the visual overview, use the ariadne-core-walkthrough skill in Claude Desktop (Cowork).
This skill teaches you how to work inside the Ariadne Core codebase. It covers repo structure, design decisions, guard rails, architecture, and which files must stay in sync. Use this skill when modifying code — not when using the system as an end user.
Read these files in this order before making changes (use Glob to find them):
**/ariadne-core/SPEC.md — source of truth for all tool signatures, API endpoints, and behavior**/ariadne-document-intelligence/SKILL.md — what agents are taught about using the system**/ariadne-core/docs/docint-architecture.md — full architecture specIf the code doesn't match the spec, the code is wrong.
Ariadne Core is an open source document extraction and retrieval pipeline — the personal/SMB alternative to enterprise document intelligence stacks. It converts documents (PDF, DOCX, PPTX, XLSX, HTML, 20+ formats) into clean Markdown + vector embeddings, and exposes them via MCP server and REST API.
License: Apache 2.0. All dependencies must be Apache 2.0 or MIT compatible.
Phase 1 (current): MarkItDown only. No local GPU required, but API keys needed for full performance (embedding, vision). Managed/Team editions will add enhanced extraction for formats MarkItDown handles poorly.
ariadne-core/
├── CLAUDE.md # Thin pointer to this skill
├── SKILL.md # Routing entry point — directs to specialized skills
├── SPEC.md # Source of truth — tools, API, behavior
├── README.md
├── LICENSE
├── docker-compose.yml # App + Postgres (for Railway / self-hosting)
├── Dockerfile # Production container
├── .env.example
├── config/
│ └── ariadne.yaml # Main config file
├── src/
│ └── pipeline/
│ ├── __init__.py
│ ├── __main__.py # CLI entrypoint: `serve` starts MCP + REST
│ ├── mcp_server.py # MCP tool definitions (Streamable HTTP)
│ ├── config.py # Config file + env var loader
│ ├── dedup.py # SHA-256 fingerprinting + dedup gate
│ ├── schema.py # Pydantic models
│ ├── stores.py # Store orchestration
│ ├── api/
│ │ ├── app.py # FastAPI application
│ │ ├── routes.py # REST endpoints (upload, documents, search, etc.)
│ │ └── auth.py # API key middleware
│ ├── extraction/
│ │ └── markitdown.py # MarkItDown wrapper
│ ├── enrichment/
│ │ ├── images.py # Image enrichment post-processing
│ │ └── vision.py # Vision API client (native Gemini generateContent)
│ ├── chunking/
│ │ └── chunker.py # Chunking strategies (by_title, by_page, fixed_size)
│ ├── embedding/
│ │ └── embedder.py # Embedding API client
│ └── storage/
│ ├── base.py # VectorStore protocol
│ └── pgvector.py # Default implementation
├── src/pyproject.toml
├── migrations/
│ └── 001_initial.sql # Consolidated schema — all tables, indexes,
│ # GIN indexes on tags/warnings, partial index
│ # on documents.source_reference. Pass 4
│ # (commit 4d7ddb7) folded the prior 002-005
│ # into this single file. Destructive deploys
│ # required when the file changes — see
│ # dave_and_bob_communication/PLAYBOOK_DESTRUCTIVE_DEPLOY.md
├── tests/
│ ├── test_*.py # Unit + integration tests
│ └── fixtures/ # Sample documents for testing
├── docs/
│ ├── docint-architecture.md # Full architecture spec
│ ├── installation.md
│ ├── configuration.md
│ ├── mcp-setup.md # How to connect MCP clients
│ ├── ob1-integration.md # How to use with Open Brain
│ ├── patches/ # Applied spec patches (historical)
│ └── skills/
│ ├── ariadne-core-build/
│ │ └── SKILL.md # This file — development skill
│ ├── ariadne-core-walkthrough/
│ │ └── SKILL.md # Visual presentation skill (Cowork)
│ ├── ariadne-core-install/
│ │ └── SKILL.md # Deployment & connection skill (Claude Code)
│ ├── ariadne-core-deploy/
│ │ └── SKILL.md # Platform-specific deploy details
│ └── ariadne-document-intelligence/
│ ├── SKILL.md # Agent skill definition (source of truth)
│ └── README.md # Skill installation guide
└── benchmarks/
└── run_benchmarks.py
These files describe the same system from different angles. When any one changes, check the others for drift:
| File | What it defines | Authority |
|---|---|---|
SPEC.md | Tool signatures, API endpoints, behavior contracts | Primary source of truth |
skills/.../SKILL.md | How agents should use the tools, caller metadata, processes | Must match SPEC tool signatures and response fields |
src/pipeline/api/routes.py | REST API endpoints (including /api/upload) | Must match SPEC API table |
docs/mcp-setup.md | Client connection instructions | Must reflect current architecture |
config/ariadne.yaml | Configuration schema | Must match docs/configuration.md |
migrations/*.sql | Database schema | Must match SPEC table definitions |
docker-compose.yml | Infrastructure (app + Postgres) | Must match SPEC deployment model |
Dockerfile | Production container | Must match deployment instructions |
Ariadne Core runs as a hosted service. One deployment serves all clients over HTTPS.
Railway / Fly.io / VPS
┌─────────────────────────┐
│ ariadne-core │
│ ├── MCP Server │
│ ├── REST API │
│ ├── Postgres + pgvec │
│ ├── MarkItDown │
│ └── Chunking/Embed │
└─────────────────────────┘
MCP Server
▲ ▲ ▲ ▲
│ │ │ └── Claude Cowork (Managed edition or roll your own OAuth)
│ │ └───── OpenClaw
│ └──────── Open Brain
└─────────── Claude Code
Authentication is by API key for Personal edition and OAuth for Managed and higher
editions. You can also create your own OAuth for the Personal edition.
No local installation required for end users. No Docker on the user's machine. No STDIO. One HTTPS URL for everything.
| Client | How it connects |
|---|---|
| Claude Code | MCP with API key |
| Claude Cowork | MCP + OAuth (Managed edition or roll your own) |
| Open Brain | MCP with API key |
| OpenClaw | MCP with API key |
| Cursor | MCP with API key |
| Any MCP client | MCP over HTTPS with API key |
| Any HTTP client | REST API over HTTPS with X-API-Key header |
References:
Since the server runs remotely, clients cannot pass local file paths. Documents must be provided as:
POST /api/upload accepts file uploads and returns a
server-side path for use with convert_documentThe ingest tool (batch directory ingestion) only works with server-side paths.
ariadne-core serve starts both the Streamable HTTP MCP server and REST API
in a single process using asyncio.gather with two uvicorn servers.
The MCP server and REST API share the same pipeline code, database connection pool, and configuration.
Tools defined in SPEC.md: convert_document, search, get_document,
list_documents, list_collections, ingest. All accept caller metadata.
convert_document and ingest accept a force flag to override dedup. For
local files, callers upload via REST POST /api/upload first and pass the
returned server-side path to convert_document.
See SPEC.md for full parameter tables and response fields.
${VAR} interpolation
in ariadne.yaml and .env for actual values. Ship .env.example with placeholders.DROP TABLE, DROP DATABASE, TRUNCATE, or unqualified DELETE FROM
in migration files.batchEmbedContents, generateContent).
Other providers require forking per SPEC.md → "Provider constraints".
Local model support exists only as a config option — never the default.embedding_model column on chunks
must always be populated.require_auth config flag
gates all endpoints except /api/health.Every incoming document is fingerprinted (SHA-256 on normalized text) BEFORE any expensive processing. If the fingerprint exists in the target collection, skip extraction/chunking/embedding. But ALWAYS record the interaction.
Two separate concerns, two tables:
documents — one row per unique document per collection. Owns the content,
fingerprint, processing_chain.document_interactions — one row per agent call. Records agent_id, agent_type,
model, initiated_by, action, was_dedup_skip, agent_notes, agent_metadata. Grows
with every touch, even dedup skips.When search returns results, include all document_interactions for each matched
document.
Different agents are the tenants, not organizations. Every MCP tool and REST
endpoint accepts caller metadata: agent_id, agent_type, model,
initiated_by, agent_notes, agent_metadata. This metadata goes into
document_interactions, not onto the document itself.
org_id column exists on all tables for future row-level security, but is not
enforced in Phase 1. Default value: 00000000-0000-0000-0000-000000000000.
Logical namespaces for documents. Dedup is scoped per collection (unique index on
collection_id, content_fingerprint). Same document can exist in multiple
collections. Search defaults to all collections but can be scoped.
documents.processing_chain (JSONB, append-only) — tracks HOW content was
processed: extraction tool, enrichment steps, embedding model, timestamps,
durations.document_interactions — tracks WHO touched the document: which agent, when,
what action, whether it was a dedup skip, plus agent_notes and agent_metadata.Every search call is recorded in the search_log table. One row per search —
not per result. Captures query, filters, results, and full caller metadata.
Single ariadne.yaml in config/. Supports ${VAR} interpolation for
secrets. Resolution: defaults → config file → env vars.
Key tables: collections, documents, document_interactions, chunks,
api_keys, search_log, schema_migrations. Everything lives in a single
consolidated migration file:
migrations/001_initial.sql — full schema, including the warnings TEXT[]
column, soft_deleted_at, the denormalized documents.source_reference
column (Pass 4), GIN indexes on tags and warnings, and the partial index
on source_reference that powers the has_source_reference filter.The runner at src/pipeline/stores.py:_apply_migrations applies pending files
in sorted order and tracks applied versions in schema_migrations. If you
need to change an already-applied migration (rather than add a new one),
the DB has to be wiped — see dave_and_bob_communication/PLAYBOOK_DESTRUCTIVE_DEPLOY.md.
The unique constraint (collection_id, content_fingerprint) on documents
enforces dedup.
force)document_interactions row (ALWAYS, even on dedup skip)The ingest tool processes files concurrently using asyncio.Semaphore(4) and
asyncio.gather. Each file is processed in a _process_file_safe() wrapper that
catches exceptions per-file so one failure doesn't abort the batch.
Thread safety: psycopg_pool is thread-safe, embedding client uses urllib (thread-safe), MarkItDown creates local state per call, singletons are read-only during processing.
force=true
should re-process.railway up
Railway reads Dockerfile and docker-compose.yml. Environment variables
are set in the Railway dashboard.
docker compose up -d
The docker-compose.yml runs the application and Postgres together.
For local development, you can run Postgres in Docker and the app on the host:
# Start just Postgres
docker compose up -d postgres
# Install and run the app
pip install -e src/
ariadne-core serve
This gives you hot reload and debugger access while still using the production database.
markitdown-ocr plugin can send pages
to vision API. Expensive. Consider explicit opt-in with cost warnings.sampling/createMessage. Not Phase 1.