Investigate RDS PostgreSQL slow query performance using pg_stat_statements. Use when CloudWatch RDS CPUUtilization alarm fires, application reports slow queries, or pg_stat_statements shows queries with mean_time > 1000ms. Covers slow query identification, index gap analysis, parameter group review, and tuning recommendations.
rds-cpu-high fires on RDS instance $RDS_INSTANCE_ID| Input | Source | Required | Description |
|---|---|---|---|
| RDS_INSTANCE_ID | $RDS_INSTANCE_ID env | YES | RDS instance identifier (e.g., prod-db-01) |
| AWS_DEFAULT_REGION | $AWS_DEFAULT_REGION env | YES | AWS region (e.g., us-east-1) |
| DB_HOST | $DB_HOST env | YES | RDS endpoint hostname (from describe-db-instances) |
| DB_PORT | $DB_PORT env | NO | PostgreSQL port (default: 5432) |
| DB_NAME | $DB_NAME env | YES | Database name for pg_stat_statements query |
| DB_USER | $DB_USER env | YES | PostgreSQL user with pg_read_all_stats role |
| HERMES_LAB_MODE | $HERMES_LAB_MODE env | NO | mock or live (default: live) |
aws cli v2, psql (PostgreSQL client), or mock-aws / mock-psql for lab moderds:DescribeDBInstances, rds:DescribeDBParameters, cloudwatch:GetMetricStatistics (read-only AWS); pg_read_all_stats role on target PostgreSQL databaseexport RDS_INSTANCE_ID=prod-db-01
export AWS_DEFAULT_REGION=us-east-1
export DB_HOST=prod-db-01.c1234example.us-east-1.rds.amazonaws.com
export DB_NAME=appdb
export DB_USER=readonly_user
HERMES_LAB_MODE=mock and add course/infrastructure/wrappers/ to PATH for offline labsStep 1.1 — Instance status and current configuration:
aws rds describe-db-instances \
--db-instance-identifier $RDS_INSTANCE_ID \
--region $AWS_DEFAULT_REGION \
--query 'DBInstances[0].{Status:DBInstanceStatus,Class:DBInstanceClass,Engine:EngineVersion,MultiAZ:MultiAZ,StorageType:StorageType}'
Expected output (instance available):
{
"Status": "available",
"Class": "db.t3.medium",
"Engine": "15.4",
"MultiAZ": false,
"StorageType": "gp2"
}
Expected output (instance in maintenance):
{
"Status": "maintenance",
"Class": "db.t3.medium",
"Engine": "15.4",
"MultiAZ": false,
"StorageType": "gp2"
}
Note: If Status is not "available", investigation may yield incomplete data. Document status and continue.
Step 1.2 — CloudWatch CPU utilization (last 30 minutes, 5-minute intervals):
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value=$RDS_INSTANCE_ID \
--start-time $(date -u -v-30M +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics Average Maximum \
--region $AWS_DEFAULT_REGION
Expected output: Datapoints array; note Maximum value for comparison with pg_stat_statements findings. Maximum > 80% confirms CPU pressure is active during investigation window.
Step 1.3 — CloudWatch DatabaseConnections (saturation check):
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=$RDS_INSTANCE_ID \
--start-time $(date -u -v-30M +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics Average Maximum \
--region $AWS_DEFAULT_REGION
Expected output: Datapoints; db.t3.medium max_connections ≈ 170; connection saturation if Average > 130.
Step 1.4 — pg_stat_statements: top 10 slowest queries by mean execution time:
psql -h $DB_HOST -p ${DB_PORT:-5432} -U $DB_USER -d $DB_NAME -c "
SELECT
round(mean_exec_time::numeric, 2) AS mean_time_ms,
round(total_exec_time::numeric, 2) AS total_time_ms,
calls,
round((rows / calls)::numeric, 2) AS rows_per_call,
query
FROM pg_stat_statements
WHERE calls > 10
ORDER BY mean_exec_time DESC
LIMIT 10;" -o /dev/stdout --csv
Expected output (clean scenario): 1-2 queries with mean_time_ms > 1000; the rest < 100ms Expected output (messy scenario): 5+ queries with mean_time_ms > 1000; overlapping tables — indicates systemic index gaps
Step 1.5 — Check for missing indexes using pg_stat_user_tables (sequential scan ratio):
psql -h $DB_HOST -p ${DB_PORT:-5432} -U $DB_USER -d $DB_NAME -c "
SELECT
schemaname,
relname AS table_name,
seq_scan,
idx_scan,
CASE WHEN (seq_scan + idx_scan) > 0
THEN round(100.0 * seq_scan / (seq_scan + idx_scan), 2)
ELSE 0
END AS seq_scan_pct
FROM pg_stat_user_tables
WHERE seq_scan > 100
ORDER BY seq_scan_pct DESC
LIMIT 10;" -o /dev/stdout --csv
Expected output: Tables with seq_scan_pct > 80% AND seq_scan > 100 are index candidates. Cross-reference table names with slow queries from Step 1.4.
Step 1.6 — Current parameter group values (work_mem, shared_buffers, max_connections):
aws rds describe-db-parameters \
--db-parameter-group-name $(aws rds describe-db-instances \
--db-instance-identifier $RDS_INSTANCE_ID \
--query 'DBInstances[0].DBParameterGroups[0].DBParameterGroupName' \
--output text \
--region $AWS_DEFAULT_REGION) \
--region $AWS_DEFAULT_REGION \
--query "Parameters[?ParameterName=='work_mem' || ParameterName=='shared_buffers' || ParameterName=='max_connections']"
Expected output: List of parameter name/value pairs. work_mem default is 4096 (4MB — often too low for complex queries with ORDER BY or GROUP BY). Record ParameterValue and ApplyMethod for each.
Use the data collected in Phase 1 to form a diagnosis. Apply the following decision tree. All conditions are numeric thresholds — do not skip to escalation without checking each branch.
Slow query analysis (from Step 1.4 output):
IF any query in pg_stat_statements has mean_exec_time > 1000ms:
THEN: Slow query confirmed. Record query text and mean_time_ms for all queries above threshold.
IF the slow query performs seq_scan on a table with seq_scan_pct > 80% in pg_stat_user_tables (Step 1.5):
THEN: Missing index suspected. Formulate CREATE INDEX CONCURRENTLY recommendation for the WHERE clause columns. NOTE: do not execute — propose for DBA approval with specific column names and table name.
ELSE IF rows_per_call > 1000 AND mean_exec_time > 1000ms:
THEN: Returning large result sets. Recommend adding LIMIT clause and WHERE filter to application query. Escalate to developer team with query text (redacted of PII first).
ELSE:
THEN: Slow query without seq_scan and normal row counts — likely a complex join or aggregate. Escalate to DBA with pg_stat_statements output.
High CPU without individual slow queries (from Steps 1.2 and 1.4):
IF mean_exec_time < 1000ms for ALL queries BUT CloudWatch CPUUtilization Maximum > 80%: THEN: High CPU without individual slow queries — high query volume suspected. IF DatabaseConnections Maximum > 130 (for db.t3.medium, from Step 1.3): THEN: Connection saturation causing queuing delay. Recommend connection pooling (PgBouncer). Escalate to infrastructure team with connection count data. ELSE: THEN: Many small queries without saturation. Recommend application-level query batching or caching review. Escalate to developer team with call counts from pg_stat_statements.
Parameter group analysis (from Step 1.6):
IF work_mem < 4096 (4MB) AND slow queries in Step 1.4 involve ORDER BY, GROUP BY, or DISTINCT keywords:
THEN: work_mem may be causing disk sorts. Recommend testing with SET work_mem = '64MB' in a test transaction. NOTE: parameter group change requires DBA approval before applying to the instance.
No issue found:
IF all metrics normal (CloudWatch CPUUtilization Average < 60%, no slow queries with mean_exec_time > 1000ms, DatabaseConnections Average < 100): THEN: Performance within normal range at investigation time. Alert may have been transient. Document findings with timestamps and close. Note: CPU spike may have resolved before investigation — check CloudWatch history for the 30 minutes before your analysis window.
Escalate to DBA when:
Include in escalation:
ALTER TABLE, DROP TABLE, or any DDL statements — schema changes require approved change windowpg_terminate_backend) without DBA approval — this drops in-flight transactions and causes application errorsVACUUM FULL or REINDEX during investigation — these lock tables and worsen performance under loadThis skill is read-only. No DDL or parameter changes are executed. If any change was mistakenly applied:
aws rds describe-events --source-identifier $RDS_INSTANCE_ID --source-type db-instance --region $AWS_DEFAULT_REGIONInvestigation is complete when:
describe-db-instances output captured and instance status is "available"[MOCK MODE] banner confirmed visible in all command outputs if HERMES_LAB_MODE=mock