技能档案

8D Root Cause Analysis — Data-Driven Control System Investigation

Use this skill when performing root cause analysis on incidents in battery energy storage systems (BESS) and industrial control systems. Triggers include: any mention of 'root cause', 'RCA', '8D', 'incident analysis', 'fault analysis', 'error investigation', 'correlated events', 'anomaly detection', or when a user provides an Array ID, timestamp, and error data point for investigation. This skill follows the 8D (Eight Disciplines) methodology adapted for data-driven analysis of control system incidents. It integrates with AWS Athena for querying telemetry, metadata, and configuration data. No source code analysis is performed — all investigation is conducted through operational data, device metadata, and optionally XML configuration files that describe device relationships and system topology.

Overview

This skill guides an AI agent through a structured 8D root cause analysis process for incidents occurring in battery energy storage systems (BESS) and related industrial control systems. The agent operates exclusively on operational data — no source code is examined. Investigation is conducted by querying time-series telemetry, alarm/event logs, device metadata, and system configuration data available through AWS Athena.

The agent's objective is to systematically identify what happened, find correlated events across the system, isolate contributing factors, and determine the root cause of an incident. Over time, the agent builds institutional knowledge from resolved incidents to improve future investigations.

Architecture Context

┌─────────────────────────────────────────────────────────────┐
│                    8D RCA Agent                              │
│  (This Skill — Orchestrates the investigation process)      │
├─────────────────────────────────────────────────────────────┤
│                    AWS Agent                                 │
│  (Data retrieval layer — queries Athena, S3, metadata)      │
├─────────────────────────────────────────────────────────────┤
│                  AWS Athena / Data Lake                      │
│  ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌──────────────┐  │
│  │Telemetry │ │ Alarms & │ │  Device   │ │Configuration │  │
│  │Time-Series│ │ Events  │ │ Metadata  │ │  (XML/JSON)  │  │
│  └──────────┘ └──────────┘ └───────────┘ └──────────────┘  │
└─────────────────────────────────────────────────────────────┘

Parameter	Description	Example
`array_id`	The Array or site identifier where the incident occurred	`ARRAY-US-TX-042`
`timestamp`	The timestamp of the error/incident data point	`2025-02-14T14:32:17Z`
`error_datapoint`	The specific data point or alarm that triggered investigation	`BMS_CELL_OVERVOLT_RACK03_MOD12`
`severity`	Optional — incident severity (critical/major/minor)	`critical`
`context`	Optional — any known context or user observations	Free text

Dimension	IS	IS NOT
WHAT	Which data point(s) are in error state	Which similar data points are normal
WHERE	Which device, rack, module, cell	Which adjacent devices are unaffected
WHEN	Exact timestamp and duration	When did it last operate normally
EXTENT	How many data points affected	What is the boundary of the impact

Why #	Question	Data-Driven Answer
1	Why did the overvoltage alarm trigger?	Cell voltage exceeded 3.70V threshold
2	Why did cell voltage rise above threshold?	Cell temperature was elevated (+3.4σ)
3	Why was cell temperature elevated?	Rack inlet temperature was high (+2.1σ)
4	Why was rack inlet temperature high?	Coolant flow rate dropped (-2.8σ)
5	Why did coolant flow rate drop?	[Requires further investigation — pump data, valve position, coolant level]

Confidence Level	Criteria	Action
Confirmed	Complete data-supported causal chain, verified by correction	Close and document
High (>80%)	Strong data support, consistent with known patterns	Recommend corrective action
Medium (50-80%)	Partial data support, plausible but gaps exist	Request additional data/review
Low (<50%)	Limited data, multiple competing hypotheses	Escalate to human SME
Inconclusive	Insufficient data to determine root cause	Document findings, request instrumentation improvement

8D Root Cause Analysis — Data-Driven Control System Investigation

8D Root Cause Analysis — Data-Driven Control System Investigation

Overview

Architecture Context

Input Requirements

The 8D Investigation Process

D0 — Planning & Scoping

D1 — Team & Expertise Identification

D2 — Problem Description (Data Collection & Characterization)

Step 1: Collect the Error Event Data

Step 2: Collect ALL Data in the Time Window

Step 3: Collect Alarms and Events

Step 4: Build the IS / IS NOT Matrix

Step 5: Establish Baseline

D3 — Interim Containment Assessment

D4 — Root Cause Analysis (Correlated Event Detection)

D5 — Permanent Corrective Action Definition

D6 — Implementation Verification

D7 — Systemic Prevention

D8 — Documentation & Recognition

Learning System (Continuous Improvement)

Knowledge Base Schema

Pattern Matching for New Incidents

Confidence Scoring

XML Configuration File Parsing

Device Hierarchy

What to Extract:

Why This Matters:

Agent Interaction Protocol

How to Collaborate with the AWS Agent

How to Interact with the Human Investigator

Critical Rules

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags