Use this skill when tasked with ensuring quality, scalability, or reliability of an AI system. Covers testing strategies (unit, integration, model quality), scaling patterns, monitoring and observability setup, model drift detection, incident response, and production QA checklists.
This skill provides a comprehensive framework for making AI systems production-ready — covering testing, scaling, monitoring, drift detection, incident response, and quality assurance.
Load only the reference file you need:
references/instructions.md for audit order, implementation steps, and escalation rulesreferences/debug.md for latency, drift, and outage recovery guidancereferences/tests.md for production-readiness validation scenariosObjective: Understand the current reliability posture.
Assessment Checklist:
## Reliability Audit
### Testing
- [ ] Unit tests exist for core logic
- [ ] Integration tests exist for API
- [ ] Model quality tests exist (accuracy regression)
- [ ] Load tests have been run
- Total test coverage: ____%
### Monitoring
- [ ] Health check endpoint exists
- [ ] Latency tracked (p50, p95, p99)
- [ ] Error rate tracked
- [ ] GPU/CPU utilization tracked
- [ ] Model confidence distribution tracked
- [ ] Alerting configured
### Scaling
- [ ] Horizontal scaling configured
- [ ] Auto-scaling policies defined
- [ ] Resource limits set
- [ ] Load tested at expected peak
### Incident Response
- [ ] Runbook exists
- [ ] Rollback procedure documented
- [ ] On-call schedule defined
- [ ] Post-mortem process established
Output: reliability_audit.md
Testing Hierarchy for AI Systems:
| Test Type | What It Validates | Example |
|---|---|---|
| Unit Tests | Individual functions | Data preprocessing logic |
| Integration Tests | API + model together | POST /predict returns valid response |
| Model Quality Tests | Model accuracy hasn't regressed | F1 > 0.85 on test set |
| Contract Tests | API schema compliance | Response matches OpenAPI spec |
| Load Tests | Performance under stress | p99 < 2s at 100 RPS |
| Shadow Tests | New model vs old model comparison | Run both, compare outputs |
Example Test Suite:
# tests/test_unit.py
def test_preprocess_handles_empty_string():
result = preprocess("")
assert result == ""
def test_preprocess_removes_html():
result = preprocess("<b>hello</b>")
assert result == "hello"
# tests/test_integration.py
def test_predict_valid_input(client):
response = client.post("/predict", json={"text": "great product"})
assert response.status_code == 200
assert "prediction" in response.json()
def test_predict_returns_confidence(client):
response = client.post("/predict", json={"text": "test"})
data = response.json()
assert 0 <= data["confidence"] <= 1
# tests/test_model_quality.py
def test_model_accuracy_above_threshold():
predictions = [model.predict(x) for x in test_data]
accuracy = sum(p == l for p, l in zip(predictions, labels)) / len(labels)
assert accuracy >= 0.85, f"Model accuracy {accuracy} below threshold 0.85"
def test_no_accuracy_regression():
current_f1 = evaluate_model(model, test_set)
baseline_f1 = load_baseline_metrics()["f1"]
assert current_f1 >= baseline_f1 - 0.02, \
f"F1 regressed: {current_f1} < {baseline_f1} - 0.02"
Scaling Patterns:
| Pattern | When to Use | Implementation |
|---|---|---|
| Horizontal Scaling | Stateless APIs | Multiple replicas behind load balancer |
| Dynamic Batching | GPU inference | Collect N requests, batch inference |
| Response Caching | Repeated queries | Redis/in-memory LRU cache |
| Async Processing | Long-running inference | Queue (Redis/RabbitMQ) + workers |
| Auto-scaling | Variable traffic | K8s HPA, cloud auto-scale groups |
Kubernetes HPA Example:
apiVersion: autoscaling/v2