Use this skill when performing root cause analysis on incidents in battery energy storage systems (BESS) and industrial control systems. Triggers include: any mention of 'root cause', 'RCA', '8D', 'incident analysis', 'fault analysis', 'error investigation', 'correlated events', 'anomaly detection', or when a user provides an Array ID, timestamp, and error data point for investigation. This skill follows the 8D (Eight Disciplines) methodology adapted for data-driven analysis of control system incidents. It integrates with AWS Athena for querying telemetry, metadata, and configuration data. No source code analysis is performed — all investigation is conducted through operational data, device metadata, and optionally XML configuration files that describe device relationships and system topology.
This skill guides an AI agent through a structured 8D root cause analysis process for incidents occurring in battery energy storage systems (BESS) and related industrial control systems. The agent operates exclusively on operational data — no source code is examined. Investigation is conducted by querying time-series telemetry, alarm/event logs, device metadata, and system configuration data available through AWS Athena.
The agent's objective is to systematically identify what happened, find correlated events across the system, isolate contributing factors, and determine the root cause of an incident. Over time, the agent builds institutional knowledge from resolved incidents to improve future investigations.
┌─────────────────────────────────────────────────────────────┐
│ 8D RCA Agent │
│ (This Skill — Orchestrates the investigation process) │
├─────────────────────────────────────────────────────────────┤
│ AWS Agent │
│ (Data retrieval layer — queries Athena, S3, metadata) │
├─────────────────────────────────────────────────────────────┤
│ AWS Athena / Data Lake │
│ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌──────────────┐ │
│ │Telemetry │ │ Alarms & │ │ Device │ │Configuration │ │
│ │Time-Series│ │ Events │ │ Metadata │ │ (XML/JSON) │ │
│ └──────────┘ └──────────┘ └───────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
When initiating an investigation, the user must provide:
| Parameter | Description | Example |
|---|---|---|
array_id | The Array or site identifier where the incident occurred | ARRAY-US-TX-042 |
timestamp | The timestamp of the error/incident data point | 2025-02-14T14:32:17Z |
error_datapoint | The specific data point or alarm that triggered investigation | BMS_CELL_OVERVOLT_RACK03_MOD12 |
severity | Optional — incident severity (critical/major/minor) | critical |
context | Optional — any known context or user observations | Free text |
Objective: Define the investigation scope and gather initial context before querying data.
Actions:
Key Questions to Answer:
Athena Query Pattern — Device Discovery:
SELECT device_id, device_type, device_name, parent_device_id, rack_id, module_id
FROM device_metadata
WHERE array_id = '{array_id}'
ORDER BY device_type, device_id;
Athena Query Pattern — Historical Occurrence:
SELECT timestamp, value, quality, status
FROM telemetry
WHERE array_id = '{array_id}'
AND datapoint_name = '{error_datapoint}'
AND timestamp BETWEEN date_add('day', -90, TIMESTAMP '{timestamp}')
AND TIMESTAMP '{timestamp}'
ORDER BY timestamp DESC
LIMIT 100;
Objective: Identify the domains of expertise needed based on the incident type.
Actions:
Agent Note: The AI agent serves as the initial investigator and data analyst. It should clearly flag when human expertise is required for domain-specific interpretation, especially for:
Objective: Build a comprehensive, data-driven description of the incident using the "IS / IS NOT" framework.
Actions:
-- Get the exact error event and surrounding data
SELECT timestamp, datapoint_name, value, quality, unit
FROM telemetry
WHERE array_id = '{array_id}'
AND datapoint_name = '{error_datapoint}'
AND timestamp BETWEEN date_add('minute', -15, TIMESTAMP '{timestamp}')
AND date_add('minute', 15, TIMESTAMP '{timestamp}')
ORDER BY timestamp ASC;
-- Get everything that happened at this array in the time window
SELECT timestamp, device_id, datapoint_name, value, quality, unit
FROM telemetry
WHERE array_id = '{array_id}'
AND timestamp BETWEEN date_add('minute', -15, TIMESTAMP '{timestamp}')
AND date_add('minute', 15, TIMESTAMP '{timestamp}')
ORDER BY timestamp ASC;
SELECT timestamp, alarm_id, alarm_name, severity, device_id, state, description
FROM alarms_events
WHERE array_id = '{array_id}'
AND timestamp BETWEEN date_add('minute', -15, TIMESTAMP '{timestamp}')
AND date_add('minute', 15, TIMESTAMP '{timestamp}')
ORDER BY timestamp ASC;
| Dimension | IS | IS NOT |
|---|---|---|
| WHAT | Which data point(s) are in error state | Which similar data points are normal |
| WHERE | Which device, rack, module, cell | Which adjacent devices are unaffected |
| WHEN | Exact timestamp and duration | When did it last operate normally |
| EXTENT | How many data points affected | What is the boundary of the impact |
-- What does "normal" look like for this data point?
SELECT
AVG(CAST(value AS DOUBLE)) as avg_val,
MIN(CAST(value AS DOUBLE)) as min_val,
MAX(CAST(value AS DOUBLE)) as max_val,
STDDEV(CAST(value AS DOUBLE)) as std_dev,
COUNT(*) as sample_count
FROM telemetry
WHERE array_id = '{array_id}'
AND datapoint_name = '{error_datapoint}'
AND timestamp BETWEEN date_add('day', -7, TIMESTAMP '{timestamp}')
AND date_add('minute', -30, TIMESTAMP '{timestamp}')
AND quality = 'GOOD';
Objective: Identify what protective actions were or should be taken.
Actions:
SELECT timestamp, datapoint_name, value
FROM telemetry
WHERE array_id = '{array_id}'
AND datapoint_name LIKE '%PROTECTION%' OR datapoint_name LIKE '%TRIP%'
OR datapoint_name LIKE '%LIMIT%' OR datapoint_name LIKE '%SHUTDOWN%'
AND timestamp BETWEEN date_add('minute', -5, TIMESTAMP '{timestamp}')
AND date_add('minute', 30, TIMESTAMP '{timestamp}')
ORDER BY timestamp ASC;
Agent Note: The agent should ALWAYS flag safety-critical containment needs prominently. Never downplay a potential safety issue. When in doubt, recommend the more conservative action.
Objective: This is the core analytical phase. Systematically identify all changes and anomalies in the time window, correlate them, and isolate the root cause.
Phase 4A: Anomaly Detection — Find Everything That Changed
For each data point active at the array during the time window, compare against baseline:
-- Identify data points that deviated from normal during the incident window
WITH baseline AS (
SELECT
datapoint_name,
AVG(CAST(value AS DOUBLE)) as baseline_avg,
STDDEV(CAST(value AS DOUBLE)) as baseline_std
FROM telemetry
WHERE array_id = '{array_id}'
AND timestamp BETWEEN date_add('day', -7, TIMESTAMP '{timestamp}')
AND date_add('hour', -1, TIMESTAMP '{timestamp}')
AND quality = 'GOOD'
GROUP BY datapoint_name
),
incident_data AS (
SELECT
datapoint_name,
timestamp,
CAST(value AS DOUBLE) as val
FROM telemetry
WHERE array_id = '{array_id}'
AND timestamp BETWEEN date_add('minute', -{window_minutes}, TIMESTAMP '{timestamp}')
AND date_add('minute', {window_minutes}, TIMESTAMP '{timestamp}')
)
SELECT
i.datapoint_name,
i.timestamp,
i.val as incident_value,
b.baseline_avg,
b.baseline_std,
ABS(i.val - b.baseline_avg) / NULLIF(b.baseline_std, 0) as z_score
FROM incident_data i
JOIN baseline b ON i.datapoint_name = b.datapoint_name
WHERE ABS(i.val - b.baseline_avg) / NULLIF(b.baseline_std, 0) > 2.0
ORDER BY i.timestamp ASC;
Phase 4B: Temporal Correlation — Build the Event Timeline
Assemble all anomalous events in strict chronological order:
TIMESTAMP | DEVICE | DATA POINT | VALUE | DEVIATION
─────────────────────────────────────────────────────────────────────────────────────────────
2025-02-14T14:27:03Z | HVAC_01 | COOLANT_FLOW_RATE | 12.3 LPM| -2.8σ (low)
2025-02-14T14:28:45Z | RACK_03 | RACK_INLET_TEMP | 34.2°C | +2.1σ (high)
2025-02-14T14:30:11Z | RACK_03_MOD_12 | CELL_TEMP_MAX | 38.7°C | +3.4σ (high)
2025-02-14T14:31:58Z | RACK_03_MOD_12 | CELL_VOLTAGE_MAX | 3.72V | +2.9σ (high)
2025-02-14T14:32:17Z | BMS_RACK_03 | BMS_CELL_OVERVOLT_ALARM | ACTIVE | *** INCIDENT ***
Phase 4C: Causal Chain Analysis
Using the timeline, work backwards from the incident to identify the causal chain:
Phase 4D: Device Relationship Analysis
If an XML configuration file is provided, use it to understand:
Parse XML to extract:
- <Device id="..." type="..." parent="...">
- <Connection source="..." target="..." type="...">
- <ProtectionZone devices="..." trip_action="...">
Phase 4E: Root Cause Determination
Apply the "5 Whys" framework using data:
| Why # | Question | Data-Driven Answer |
|---|---|---|
| 1 | Why did the overvoltage alarm trigger? | Cell voltage exceeded 3.70V threshold |
| 2 | Why did cell voltage rise above threshold? | Cell temperature was elevated (+3.4σ) |
| 3 | Why was cell temperature elevated? | Rack inlet temperature was high (+2.1σ) |
| 4 | Why was rack inlet temperature high? | Coolant flow rate dropped (-2.8σ) |
| 5 | Why did coolant flow rate drop? | [Requires further investigation — pump data, valve position, coolant level] |
Classification of Root Cause:
Objective: Based on the confirmed or probable root cause, define corrective actions.
Actions:
-- Check if the root cause condition exists at other arrays
SELECT array_id, COUNT(*) as occurrence_count,
MAX(timestamp) as most_recent
FROM telemetry
WHERE datapoint_name = '{root_cause_datapoint}'
AND {root_cause_condition}
AND timestamp > date_add('day', -30, NOW())
GROUP BY array_id
ORDER BY occurrence_count DESC;
Objective: After corrective actions are implemented, verify they resolved the issue.
Actions:
SELECT timestamp, datapoint_name, value
FROM telemetry
WHERE array_id = '{array_id}'
AND datapoint_name IN ('{affected_datapoints}')
AND timestamp > TIMESTAMP '{correction_timestamp}'
ORDER BY timestamp ASC;
Objective: Prevent recurrence across the fleet.
Actions:
-- Create a predictive query: Did the root cause condition appear before the incident?
-- If so, how far in advance? This becomes an early warning indicator.
SELECT timestamp, value,
date_diff('minute', timestamp, TIMESTAMP '{incident_timestamp}') as minutes_before_incident
FROM telemetry
WHERE array_id = '{array_id}'
AND datapoint_name = '{root_cause_datapoint}'
AND timestamp BETWEEN date_add('hour', -24, TIMESTAMP '{incident_timestamp}')
AND TIMESTAMP '{incident_timestamp}'
AND {abnormal_condition}
ORDER BY timestamp ASC;
Objective: Close the investigation with complete documentation.
Deliverables:
The agent improves over time by maintaining a knowledge base of resolved incidents.
Each resolved incident adds an entry:
{
"incident_id": "INC-2025-0214-001",
"array_id": "ARRAY-US-TX-042",
"timestamp": "2025-02-14T14:32:17Z",
"error_datapoint": "BMS_CELL_OVERVOLT_RACK03_MOD12",
"root_cause_category": "equipment_failure",
"root_cause_summary": "HVAC coolant pump degradation caused reduced flow, leading to thermal runaway in Rack 03",
"causal_chain": [
"coolant_pump_degradation",
"reduced_coolant_flow",
"elevated_rack_inlet_temperature",
"elevated_cell_temperature",
"cell_voltage_rise",
"overvoltage_alarm"
],
"early_warning_indicators": [
{
"datapoint": "COOLANT_FLOW_RATE",
"condition": "value < baseline - 2*stddev",
"lead_time_minutes": 5
}
],
"corrective_actions": ["pump_replacement", "flow_threshold_update"],
"confidence": "confirmed",
"resolution_date": "2025-02-15",
"similar_incidents": ["INC-2024-0918-003", "INC-2024-1201-007"]
}
When investigating a new incident, the agent should:
The agent should track and report its confidence using this framework:
| Confidence Level | Criteria | Action |
|---|---|---|
| Confirmed | Complete data-supported causal chain, verified by correction | Close and document |
| High (>80%) | Strong data support, consistent with known patterns | Recommend corrective action |
| Medium (50-80%) | Partial data support, plausible but gaps exist | Request additional data/review |
| Low (<50%) | Limited data, multiple competing hypotheses | Escalate to human SME |
| Inconclusive | Insufficient data to determine root cause | Document findings, request instrumentation improvement |
When the user provides an XML configuration file, extract and use the following:
<!-- Example structure — actual schema will vary -->
<System id="ARRAY-US-TX-042">
<Rack id="RACK_03">
<Module id="MOD_12">
<Cell id="CELL_01" ... />
</Module>
<BMS id="BMS_RACK_03" monitors="RACK_03" />
</Rack>
<PCS id="PCS_01" connected_racks="RACK_01,RACK_02,RACK_03" />
<HVAC id="HVAC_01" cooling_zones="RACK_01,RACK_02,RACK_03,RACK_04" />
</System>
A failure in a shared resource (e.g., a cooling loop serving 4 racks) can cause correlated symptoms across all devices sharing that resource. The configuration file helps the agent distinguish between:
The 8D RCA agent does not query Athena directly. It formulates the analytical questions and query patterns, then delegates data retrieval to the AWS agent.
Workflow:
Data Request Format:
{
"request_type": "telemetry_query",
"purpose": "D4 - Anomaly detection in ±15min window",
"array_id": "ARRAY-US-TX-042",
"time_range": {
"start": "2025-02-14T14:17:17Z",
"end": "2025-02-14T14:47:17Z"
},
"filters": {
"device_ids": ["RACK_03", "HVAC_01"],
"datapoint_pattern": "*"
},
"expected_output": "All telemetry data points for specified devices in time range"
}
The agent should: