Route and enrich pipeline failure alerts from any source (PagerDuty, Slack, email, Airflow callback, webhook) to PATCHIT for autonomous remediation. Normalizes alert payloads from different platforms into PATCHIT's standard event format, deduplicates, applies suppression rules, and decides whether to auto-remediate or escalate to on-call. Trigger for: 'forward this alert to patchit', 'set up alerting', 'configure webhook', 'auto-fix when PagerDuty fires', 'suppress flapping alerts'. Keywords: alert routing, webhook, PagerDuty, alerting, alert normalization, suppression, on-call, escalation.
Normalize, deduplicate, and route pipeline failure alerts from any source to PATCHIT — or to on-call when human judgment is required.
open | ack | in_remediation | resolved | suppressed| Source | Integration | Fields extracted |
|---|---|---|
| Airflow callback | Built-in PATCHIT poller | dag_id, task_id, exception, log_url |
| Slack #alerts | Incoming webhook parser | pipeline name, error snippet |
| PagerDuty webhook | /api/webhook/pagerduty | incident title, service, body |
| AWS CloudWatch | SNS → PATCHIT webhook | AlarmName, MetricName, StateReason |
| Datadog | Webhook integration | @patchit-fix tag in alert body |
| Generic webhook | /api/ingest | Raw JSON passthrough |
POST /api/ingest # Generic PATCHIT event format
POST /api/webhook/pagerduty # PagerDuty v3 webhook format
POST /api/webhook/datadog # Datadog webhook format
POST /api/webhook/cloudwatch # AWS SNS / CloudWatch format
Configure in PATCHIT UI → Settings → Alert routing:
{
"auto_remediate_severity": ["P1", "P2"],
"escalate_to_human_severity": ["P3", "P4"],
"suppression_rules": [
{
"pipeline_id_pattern": "test_.*",
"suppress": true,
"reason": "Test DAGs — ignore"
},
{
"pipeline_id_pattern": "daily_batch",
"maintenance_window": "01:00-04:00 UTC",
"suppress": true
}
],
"dedup_window_minutes": 5,
"critical_pipelines": ["gold_revenue", "dim_products", "stg_customers"]
}
PATCHIT translates heterogeneous alert formats into a standard event:
{
"event_id": incident["id"],
"pipeline_id": incident["service"]["name"],
"platform": _detect_platform(incident["body"]["details"]),
"log_text": incident["body"]["details"]["error"] or incident["title"],
"severity": incident["urgency"], # high → P1, low → P2
}
{
"event_id": alert["id"],
"pipeline_id": alert["tags"].get("pipeline"),
"platform": alert["tags"].get("platform", "unknown"),
"log_text": alert["body"],
}
{
"event_id": message["AlarmArn"],
"pipeline_id": message["AlarmName"],
"platform": "glue" if "glue" in message["AlarmName"].lower() else "aws",
"log_text": message["StateChangeTime"] + "\n" + message["StateReason"],
}
Alerts with the same pipeline_id + similar log_text fingerprint within the dedup window are deduplicated:
reason: duplicate| Condition | Action |
|---|---|
error_category = security_access | Always escalate — never auto-fix |
confidence < 0.5 after 3 attempts | Page on-call with full RCA |
critical_pipeline tag set | Parallel: auto-fix + page on-call |
maintenance_window active | Suppress all alerts for pipeline |
flap_count > 5 in 1h | Suppress + page on-call for flap investigation |
PATCHIT_PAGERDUTY_ROUTING_KEY=r_... # PagerDuty Events API v2
PATCHIT_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PATCHIT_ONCALL_CHANNEL=#data-oncall
When PATCHIT escalates, it sends:
🚨 PATCHIT Escalation: `gold_revenue` requires human review
Platform: AWS Glue
Severity: P1
Confidence: 0.42 (below threshold)
Attempts: 3
RCA Summary: KeyError on column `revenue_usd` — possible upstream schema change.
Suggested next step: Check DMS replication for source table `orders.revenue`.
→ Full report: http://patchit.internal/reports/abc123
→ Audit trail: http://patchit.internal/ui/audit/abc123
All suppressed alerts are logged to var/audit/suppressed.jsonl with: