Investigate LLM analytics evaluations of both types — `hog` (deterministic code-based) and `llm_judge` (LLM-prompt-based). Find existing evaluations, inspect their configuration, run them against specific generations, query individual pass/fail results, and generate AI-powered summaries of patterns across many runs. Use when the user asks to debug why an evaluation is failing, surface common failure modes, compare results across filters, dry-run a Hog evaluator, prototype a new LLM-judge prompt, or manage the evaluation lifecycle (create, update, enable/disable, delete).
PostHog evaluations score $ai_generation events. Each evaluation is one of two types,
both first-class:
hog — deterministic Hog code that returns true/false (and optionally N/A).
Best for objective rule-based checks: format validation (JSON parses, schema matches),
length limits, keyword presence/absence, regex patterns, structural assertions, latency
thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this
when the criterion can be expressed as code.llm_judge — an LLM scores generations against a prompt you write. Best for
subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic
drift, instruction-following. Costs an LLM call per run and requires AI data
processing approval at the org level.Results from both types land in ClickHouse as $ai_evaluation events with the same
schema, so the read/query/summary workflows are identical regardless of evaluator type —
the only thing that changes is whether $ai_evaluation_reasoning was written by Hog
code or by an LLM.
This skill covers the full lifecycle: list/inspect/manage evaluation configs (Hog or LLM judge), run them on specific generations, query individual results, and get an AI-generated summary of pass/fail/N/A patterns across many runs.
| Tool | Purpose |
|---|---|
posthog:evaluations-get | List/search evaluation configs (filter by name, enabled flag) |
posthog:evaluation-get | Get a single evaluation config by UUID |
posthog:evaluation-create | Create a new llm_judge or hog evaluation |
posthog:evaluation-update | Update an existing evaluation (name, prompt, enabled, …) |
posthog:evaluation-delete | Soft-delete an evaluation |
posthog:evaluation-run | Run an evaluation against a specific $ai_generation event |
posthog:evaluation-test-hog | Dry-run Hog source against recent generations (no save) |
posthog:llm-analytics-evaluation-summary-create | AI-powered summary of pass/fail/N/A patterns across runs |
posthog:execute-sql | Ad-hoc HogQL over $ai_evaluation events |
posthog:query-llm-trace | Drill into the underlying generation that an evaluation scored |
The first seven evaluation-* tools are hand-coded; llm-analytics-evaluation-summary-create
is generated from products/llm_analytics/mcp/tools.yaml.
Every run of an evaluation emits an $ai_evaluation event. Key properties:
| Property | Meaning |
|---|---|
$ai_evaluation_id | UUID of the evaluation config |
$ai_evaluation_name | Human-readable name |
$ai_target_event_id | UUID of the $ai_generation event being scored |
$ai_trace_id | Parent trace ID (for jumping to the trace UI) |
$ai_evaluation_result | true = pass, false = fail |
$ai_evaluation_reasoning | Free-text explanation (set by the LLM judge or Hog code) |
$ai_evaluation_applicable | false when the evaluator decided the generation is N/A |
When $ai_evaluation_applicable = false, the run counts as N/A regardless of $ai_evaluation_result.
For evaluations that don't support N/A, this property may be null — treat null as "applicable".
Works the same way for llm_judge and hog evaluations — the differences only matter
when you eventually go to fix the evaluator (edit the prompt vs. edit the Hog source).