Use when analyzing large, legacy, or undocumented codebases to build a navigable knowledge graph for context retrieval.
Analyze large, legacy, or undocumented codebases and produce a persistent knowledge graph. The graph enables fast, focused context retrieval in future sessions — any AI agent (Amp, Copilot, etc.) can query it to understand unfamiliar code without re-reading everything.
<repo>/archaeology/kg/Run scripts/index.py to build or update the knowledge graph.
python scripts/index.py <repo-root> [options]
| Option | Default | Description |
|---|---|---|
--output-dir |
<repo>/archaeology/kg/ |
| Where to write graph files |
--full | off | Force full re-index (ignore hashes) |
--since-git <ref> | — | Only index files changed since <ref> (commit, tag, branch) |
What it produces:
nodes.jsonl — all discovered symbols and structural elementsedges.jsonl — relationships between nodesfiles.jsonl — file metadata and content hashesindexes/ — lookup indexes (by symbol, path, tag)summaries/ — per-module and per-package prose summariesIncremental behavior: On subsequent runs, only files whose content hash has changed are re-processed. Use --full to force a complete rebuild.
Run scripts/query_graph.py to retrieve context bundles from the graph.
python scripts/query_graph.py <kg-dir> [options]
| Option | Default | Description |
|---|---|---|
--symbol <name> | — | Find nodes matching a symbol name |
--path <glob> | — | Filter by file path |
--tags <tag,...> | — | Filter by tags (e.g., god_object,hidden_io) |
--hops <n> | 2 | Max edge traversal depth from matched nodes |
--max-nodes <n> | 50 | Cap on returned nodes |
--format | markdown | Output format (markdown or json) |
Output: A markdown context bundle containing the matched subgraph — nodes, edges, evidence pointers, and summaries — ready for pasting into an agent session.
The graph uses JSONL files with three entity types. See reference/graph-schema.md for the full schema.
| Type | Description |
|---|---|
file | Source file |
module | Language module / namespace |
package | Package / crate / gem |
class | Class or struct |
type | Type alias, interface, protocol |
function | Free function |
method | Method on a class/type |
endpoint | HTTP / RPC / GraphQL endpoint |
config | Configuration key or block |
datastore | Database, cache, queue |
event | Event or message type |
job | Background job / cron task |
test | Test case or suite |
build_target | Build rule or target |
external_service | Third-party service dependency |
doc | Documentation artifact |
| Edge | Meaning |
|---|---|
contains | Parent structurally contains child |
defines | File/module defines a symbol |
imports | Source imports target |
calls | Source invokes target |
implements | Source implements target interface |
inherits | Source extends target |
reads | Source reads from datastore/config |
writes | Source writes to datastore/config |
emits | Source emits event |
consumes | Source consumes event |
exposes | Module exposes an endpoint |
uses_config | Source references config key |
depends_on | Build/deploy dependency |
tests | Test covers target |
documents | Doc documents target |
Every node and edge carries an evidence array:
{"file": "src/server.py", "start_line": 42, "end_line": 58}
archaeology/kg/
├── nodes.jsonl
├── edges.jsonl
├── files.jsonl
├── indexes/
│ ├── by_symbol.json
│ ├── by_path.json
│ └── by_tag.json
└── summaries/
├── <module>.md
└── overview.md
When Python is unavailable, the agent builds the graph manually using read/search tools:
main, index, app, server, config files, build filesnodes.jsonl, edges.jsonl, files.jsonl directly in archaeology/kg/summaries/The agent should prioritize breadth-first: get the coarse structure right before deep-diving into any single module.
Two-pass approach. See reference/traversal-strategy.md for details.
Pass 1 — Coarse inventory:
main, index, app, server, CLI definitions)Pass 2 — Targeted deepening:
Detected patterns are stored as tags on nodes:
| Tag | Meaning |
|---|---|
god_object | Class/module with excessive responsibilities |
feature_envy | Entity that over-references another module's internals |
duplicate_logic | Near-duplicate implementations across files |
hidden_io | I/O buried inside business logic |
stringly_typed_config | Config accessed via raw strings without validation |
shared_mutable_state | Globals or shared state without synchronization |
temporal_coupling | Operations that must happen in a specific undocumented order |
files.jsonl--since-git <ref> uses git diff --name-only to scope the update to changed files