Name: Exploring LLM evaluations
Author: PostHog

Exploring LLM evaluations

Investigate LLM analytics evaluations of both types — `hog` (deterministic code-based) and `llm_judge` (LLM-prompt-based). Find existing evaluations, inspect their configuration, run them against specific generations, query individual pass/fail results, and generate AI-powered summaries of patterns across many runs. Use when the user asks to debug why an evaluation is failing, surface common failure modes, compare results across filters, dry-run a Hog evaluator, prototype a new LLM-judge prompt, or manage the evaluation lifecycle (create, update, enable/disable, delete).

PostHog32,641 Sterne08.04.2026

Beruf
Kategorien: Debugging

PostHog evaluations score $ai_generation events. Each evaluation is one of two types, both first-class:

hog — deterministic Hog code that returns true/false (and optionally N/A). Best for objective rule-based checks: format validation (JSON parses, schema matches), length limits, keyword presence/absence, regex patterns, structural assertions, latency thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this when the criterion can be expressed as code.
llm_judge — an LLM scores generations against a prompt you write. Best for subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic drift, instruction-following. Costs an LLM call per run and requires AI data processing approval at the org level.

Results from both types land in ClickHouse as $ai_evaluation events with the same schema, so the read/query/summary workflows are identical regardless of evaluator type — the only thing that changes is whether $ai_evaluation_reasoning was written by Hog code or by an LLM.

This skill covers the full lifecycle: list/inspect/manage evaluation configs (Hog or LLM judge), run them on specific generations, query individual results, and get an AI-generated summary of pass/fail/N/A patterns across many runs.

Exploring LLM evaluations

PostHog32,641 Sterne08.04.2026

Beruf
Kategorien: Debugging

PostHog evaluations score $ai_generation events. Each evaluation is one of two types, both first-class:

hog — deterministic Hog code that returns true/false (and optionally N/A). Best for objective rule-based checks: format validation (JSON parses, schema matches), length limits, keyword presence/absence, regex patterns, structural assertions, latency thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this when the criterion can be expressed as code.
llm_judge — an LLM scores generations against a prompt you write. Best for subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic drift, instruction-following. Costs an LLM call per run and requires AI data processing approval at the org level.

Tool	Purpose
`posthog:evaluations-get`	List/search evaluation configs (filter by name, enabled flag)
`posthog:evaluation-get`	Get a single evaluation config by UUID
`posthog:evaluation-create`	Create a new `llm_judge` or `hog` evaluation
`posthog:evaluation-update`	Update an existing evaluation (name, prompt, enabled, …)
`posthog:evaluation-delete`	Soft-delete an evaluation
`posthog:evaluation-run`	Run an evaluation against a specific `$ai_generation` event
`posthog:evaluation-test-hog`	Dry-run Hog source against recent generations (no save)
`posthog:llm-analytics-evaluation-summary-create`	AI-powered summary of pass/fail/N/A patterns across runs
`posthog:execute-sql`	Ad-hoc HogQL over `$ai_evaluation` events
`posthog:query-llm-trace`	Drill into the underlying generation that an evaluation scored

Property	Meaning
`$ai_evaluation_id`	UUID of the evaluation config
`$ai_evaluation_name`	Human-readable name
`$ai_target_event_id`	UUID of the `$ai_generation` event being scored
`$ai_trace_id`	Parent trace ID (for jumping to the trace UI)
`$ai_evaluation_result`	`true` = pass, `false` = fail
`$ai_evaluation_reasoning`	Free-text explanation (set by the LLM judge or Hog code)
`$ai_evaluation_applicable`	`false` when the evaluator decided the generation is N/A

Exploring LLM evaluations

Exploring LLM evaluations

Tools

Event schema

Workflow: investigate why an evaluation is failing

Step 1 — Find the evaluation

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags