Name: Monitoring Ingestion Pipeline
Author: PostHog

Monitoring the ingestion pipeline with Grafana MCP

The ingestion pipeline (nodejs/) is PostHog's Node.js event processing layer. It consumes events from Kafka (produced by the capture service), runs them through processing steps (person resolution, group assignment, property overrides, etc.), and produces enriched events to ClickHouse-bound Kafka topics.

A single codebase is deployed as many K8s Deployments via the posthog-node Helm chart. Each deployment sets PLUGIN_SERVER_MODE and is distinguished in metrics by two default Prometheus labels:

ingestion_pipeline — values: general, heatmaps, client_warnings, errortracking
ingestion_lane — values: main, overflow, historical, async

The app label (set by K8s) matches the deployment name and is the most universal scope filter across all telemetry domains.

This skill teaches how to discover live metrics using the Grafana MCP tools rather than memorizing metric names that change as the code evolves.

Monitoring the ingestion pipeline with Grafana MCP

ingestion_pipeline — values: general, heatmaps, client_warnings, errortracking
ingestion_lane — values: main, overflow, historical, async

The app label (set by K8s) matches the deployment name and is the most universal scope filter across all telemetry domains.

This skill teaches how to discover live metrics using the Grafana MCP tools rather than memorizing metric names that change as the code evolves.

Domain	Datasource UID	Discovery tool	Scope filter
App metrics (VictoriaMetrics)	`victoriametrics`	`list_prometheus_metric_names`	See metric prefixes below
App metrics (realtime)	`victoriametrics-realtime`	same	same (lower retention, higher resolution)
Logs	`P44D702D3E93867EC` (Loki-logs)	`list_loki_label_names`	`app=~"ingestion-.*"`
Profiling	`pyroscope`	`list_pyroscope_profile_types`	See Pyroscope services below
CloudWatch (ElastiCache, MSK, RDS)	`P034F075C744B399F`	`query_prometheus`	env-specific cluster IDs
Dashboards	n/a	`search_dashboards`	query `"ingestion"` or deployment name

Deployment name	Mode	Lane	Consumer group	Consume topic pattern
`ingestion-events`	`ingestion-v2`	`main`	`ingestion-events`	`ingestion-events-{partitions}`
`ingestion-events-overflow`	`ingestion-v2`	`overflow`	`ingestion-events-overflow`	`ingestion-events-overflow-{partitions}`
`ingestion-events-historical`	`ingestion-v2`	`historical`	`ingestion-events-historical`	`ingestion-events-historical-{partitions}`
`ingestion-events-async`	`ingestion-v2`	`async`	`ingestion-events-async`	`events_plugin_ingestion_async`
`ingestion-client-warnings`	`ingestion-v2`	—	`ingestion-client-warnings`	`client_iwarnings_ingestion`
`ingestion-heatmaps`	`ingestion-v2`	—	`ingestion-heatmaps`	`heatmaps_ingestion`
`ingestion-general-turbo`	`ingestion-v2`	—	`ingestion-general-turbo`	`ingestion-general-turbo-{partitions}`
`ingestion-batch-imports`	`ingestion-v2`	—	`ingestion-batch-imports`	`ingestion-batch-imports`
`ingestion-logs`	`ingestion-logs`	—	`ingestion-logs`	`logs_ingestion`
`ingestion-errortracking-main`	`ingestion-errortracking`	—	`ingestion-errortracking`	`ingestion-errortracking-main-{partitions}`
`recordings-blob-ingestion-v2`	`recordings-blob-ingestion-v2`	—	`session-recordings-blob-v2`	`session_recording_snapshot_item_events`

Prefix	Domain	Key scope labels
`ingestion_*`	Core ingestion app metrics (~80 metrics)	`app`, `ingestion_pipeline`, `ingestion_lane`
`consumed_batch_*`	Kafka consumer batch processing	`topic`, `groupId`
`consumer_batch_` / `consumer_background_`	Consumer loop health	`topic`, `groupId`
`kafka_broker_*`	librdkafka broker stats	`broker_id`, `broker_name`, `consumer_group`
`kafka_consumer_*`	Consumer rebalance, assignment	`groupId`, `type`
`events_pipeline_*`	Legacy pipeline step metrics	`step_name`
`person_*`	Person processing (~30 metrics)	`db_write_mode`, `operation`, `method`
`group_*` (non-AWS)	Group processing	`operation`
`personhog_*`	PersonHog gRPC client + service	`method`, `source`, `client`
`overflow_redirect_*`	Stateful overflow routing	`type`, `result`, `decision`, `operation`
`cookieless_*`	Cookieless mode	—
`http_request_duration_seconds`	HTTP health/readiness server	`method`, `route`, `status_code`
`recording_blob_ingestion_v2_*`	Session replay ingestion	`app`
`logs_ingestion_*`	Logs ingestion pipeline	`app`
`error_tracking_` / `cymbal_`	Error tracking pipeline	`app`
`kminion_kafka_*`	KMinion consumer group lag & topic offsets	`group_id`, `topic_name`, `partition_id`
`aws_msk_kafka_*`	MSK broker-side JMX metrics	`environment`
`warpstream_agent_*`	WarpStream agent metrics	varies
`kube_` / `container_`	K8s resources	`namespace="posthog"`, `container=~"ingestion-.*"`
`pg_` / `pgbouncer_`	Postgres exporter	varies
`ClickHouseMetrics_` / `ClickHouseProfileEvents_` / `ClickHouseAsyncMetrics_*`	ClickHouse cluster health	`type` (=cluster role)
`kafka_connect_*`	Kafka Connect bridge to ClickHouse	`namespace`, `connector`
`posthog_celery_clickhouse_*`	CH health monitors from Django celery	`scenario`

Redis instance	ElastiCache cluster (prod-us)	Env var	Use
Ingestion Redis	`ingestion-prod-redis`	`INGESTION_REDIS_HOST`	Overflow state, pub/sub coordination
PostHog/Primary Redis	`posthog-solo`	`POSTHOG_REDIS_HOST`	Billing/quota, restrictions, general
Cookieless Redis	`cookieless-prod-redis`	`COOKIELESS_REDIS_HOST`	Cookieless server hash mode
CDP Redis	`cdp-delivery-prod-redis`	`CDP_REDIS_HOST`	CDP Hog function delivery
Dedup Redis	`ingestion-duplicates-prod-redis`	`DEDUPLICATION_REDIS_HOST`	Event deduplication

DB	Aurora cluster (prod-us)	Ingestion PgBouncer
Main app DB	`posthog-cloud-prod-us-east-1` (2x `db.r8g.16xlarge`)	`ingestion-default-pgbouncer.posthog.svc.cluster.local`
Persons DB	`posthog-cloud-persons-prod-us-east-1` (3x `db.r8g.24xlarge`)	`ingestion-events-pgbouncer.posthog.svc.cluster.local`

Monitoring Ingestion Pipeline

Monitoring the ingestion pipeline with Grafana MCP

Monitoring Ingestion Pipeline

Monitoring the ingestion pipeline with Grafana MCP

Environment context

Observability landscape

Stable waypoints

Deployment roles

Metric prefixes

Redis topology

Kafka topology

Postgres topology

ClickHouse topology

Pyroscope services

Grafana dashboards

Discovery workflows

Prometheus / VictoriaMetrics

Loki (logs)

Pyroscope (profiling)

Dashboards

Redis / ElastiCache

Postgres / Aurora

ClickHouse

Key metric domains

Investigation playbooks

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags

`type` label	Role	Notes
`events`	Main analytics events cluster	Consumes `clickhouse_events_json`
`online`	Online/fast queries cluster	Replicated from events
`offline`	Offline/batch queries cluster	Replicated from events
`medium`	Medium-sized tables	Persons, groups
`small`	Small/config tables	Infrequent writes
`sessions`	Session replay data	Consumes session recording topics
`logs`	Logs cluster	Consumes logs topics
`logs-new-schema`	Logs new schema migration	Migration target
`ai-events`	AI/LLM events	Consumes AI events topics
`endpoints`	API endpoints cluster	Lightweight
`migrations`	Migration-specific	Schema changes
`aux` / `ops`	Auxiliary/operations	Maintenance
`batch-exports`	Batch exports	prod-us has this; may not exist in prod-eu
`test`	Testing cluster	May not exist in all envs

Service name	Deployment
`ingestion/ingestion-events`	Main analytics
`ingestion/ingestion-events-overflow`	Overflow lane
`ingestion/ingestion-events-historical`	Historical lane
`ingestion/ingestion-events-async`	Async lane
`ingestion/ingestion-heatmaps`	Heatmaps
`ingestion/ingestion-client-warnings`	Client warnings
`ingestion/ingestion-general-turbo`	General turbo (prod-us only)
`ingestion/ingestion-logs`	Logs ingestion
`ingestion/ingestion-batch-imports`	Batch imports
`ingestion-errortracking-main/ingestion-errortracking-main`	Error tracking
`recordings/recordings-blob-ingestion-v2`	Session replay

UID	Title	Focus
`ingestion-general`	Ingestion - General	Cross-service overview, E2E lag, topic flow
`ingestion-pipelines`	Ingestion - Pipelines	Per-lane pipeline step breakdown
`ingestion-pipeline-performance`	Ingestion - Pipeline Performance	Step latency, batch utilization
`ingestion-reliability`	Ingestion - Reliability	Error rates, DLQ, drop causes
`ingestion-autoscaling`	Ingestion - Autoscaling	HPA/KEDA scaling
`ingestion-person-processing`	Ingestion -- Person Processing	Person store, merge, cache
`ingestion-group-processing`	Ingestion -- Group Processing	Group store
`ingestion-session-recordings`	Session Replay -- Ingestion	Replay blob pipeline
`ingestion-capture`	Ingestion - Capture	Capture-specific ingestion metrics
`ceef2kuqw66tca`	Ingestion copy for warpstream	WarpStream-specific
`personhog-service`	Personhog service	PersonHog latency decomposition
`personhog-cdp-migration`	PersonHog CDP/NodeJS migration	PersonHog rollout
`dbfgkwxs3gw8owd`	KMinion Consumer Group Lag	Consumer lag by group (including CH groups)
`logs`	Logs (product)	Logs ingestion
`vm-clickhouse-cluster-overview`	ClickHouse (cluster overview)	QPS, memory, disk, replication, parts, merges
`8aa35a4a-091a-4645-ac8f-ae46901f0060`	ClickHouse Ingestion Layer - Resource Usage	K8s resources for `chi-ingestion-*` pods
`dafd3tvakk4t1cd`	ClickHouse - Data Inserted Per Table	Insert rates per table
`edvegyvt4u8sge`	ClickHouse - Query Metrics	Query performance
`clickhouse-keeper`	ClickHouse Keeper	ZooKeeper replacement health
`ef2loyheonm68a`	ClickHouse - table sizes and growth	Storage growth
`ef7h2todfg4xsd`	New ClickHouse Cluster Merge Overview	Merge throughput
`cdzv7o1635n9ca`	Kafka Connect	Kafka Connect tasks, lag, DuckLake sink
`ddpxkllwxg268e`	(ingestion vs past)	CH ingestion rate vs historical comparison
`deoz13wy08wsga`	ClickHouse - Disk capacity (EU ONLY)	EU-specific disk dashboard