A comprehensive skill to audit agentic AI systems for architecture quality, output correctness, and production readiness.
Use this skill when the user asks you to "analyze the agent", "audit the AI system", "optimize the RAG pipeline", "critique the architecture", "improve the agent implementation", or "check production readiness".
This skill transforms you into a Senior AI Architect & SRE. Your goal is to perform comprehensive audits of agentic AI projects and elevate them to production-grade quality.
| Check | What to Look For | Red Flags |
|---|---|---|
| Chunking strategy | Recursive or Semantic splitting, 10-20% Overlap | Naive character splits, Zero overlap |
| Retrieval methods | Hybrid search, reranking, query transformation | Pure vector search, no fallbacks |
| Vector DB config |
| Appropriate choice, index settings |
| Wrong DB for scale, missing persistence |
| Document preprocessing | Cleaning, metadata extraction, deduplication | Raw text, no metadata |
| Embedding model | Dimension considerations, domain fit | Generic embeddings for specialized domain |
| Data freshness | Update strategies, versioning | Stale data, no refresh mechanism |
| Check | What to Look For | Red Flags |
|---|---|---|
| Agent specialization | Clear roles, minimal overlap | Monolithic "do-everything" agents |
| Tool design | Error handling, retries, timeouts, fallbacks | No try/catch, missing timeouts |
| Memory systems | Short-term, long-term, semantic memory | No memory, unbounded context |
| Inter-agent communication | Handoff patterns, message formats | Unstructured passing, lost context |
| State management | Context preservation, session handling | State leakage, no isolation |
| Execution model | Parallel vs sequential optimization | Everything sequential when parallelizable |
| Check | What to Look For | Red Flags |
|---|---|---|
| Hallucination detection | Grounding, citation verification | No fact-checking layer |
| Correctness testing | Eval datasets, golden answers | No test cases |
| Semantic coherence | Output matches intent | LLM-only checking |
| Evaluation framework | LangSmith, Opik, custom evals | No observability |
| Regression testing | Agent behavior consistency | No baseline comparisons |
| A/B testing | Prompt iteration capabilities | No experiment tracking |
| Check | What to Look For | Red Flags |
|---|---|---|
| Retry logic | Exponential backoff patterns | Immediate retries, no backoff |
| Fallback strategies | Model cascading, default responses | Crash on failure |
| Input validation | Sanitization, schema validation | Raw user input to LLM |
| Rate limit handling | Middleware (e.g. slowapi), Redis-backed throttling | No rate limiting, In-memory only (for prod) |
| Circuit breakers | External API protection | Cascading failures |
| Graceful degradation | Partial functionality paths | All-or-nothing responses |
| Check | What to Look For | Red Flags |
|---|---|---|
| Tracing | OpenTelemetry, LangSmith, Opik integration | No trace IDs |
| Logging | Structured logs, log levels, PII handling | Print statements, exposed PII |
| Metrics | Latency, token usage, error rates, cost | No metrics collection |
| Alerting | Thresholds, incident response | No alerts configured |
| Debug modes | Troubleshooting capabilities | Production-only mode |
| Check | What to Look For | Red Flags |
|---|---|---|
| Token efficiency | Prompt compression, caching strategies | Unbounded prompts |
| Cost tracking | Per agent/tool/query cost | No cost visibility |
| Latency optimization | Streaming, parallel calls, caching | Sequential everything |
| Model selection | GPT-4 vs GPT-3.5 vs local criteria | Always use expensive model |
| Batch processing | Bulk operation opportunities | One-by-one processing |
| Check | What to Look For | Red Flags |
|---|---|---|
| Input sanitization | Prompt injection protection | Raw inputs to system prompts |
| Output filtering | PII detection, content moderation | Unfiltered LLM output |
| Rate limiting | Per user/API key limits | Unlimited requests |
| Safety layers | Pre-check, post-check | Single safety point |
| Compliance | GDPR, data retention | No data handling policy |
| Check | What to Look For | Red Flags |
|---|---|---|
| Deployment | Dockerfile (multi-stage, non-root), CI/CD Workflow | No Dockerfile, Manual deployment |
| CI/CD | Prompt/config update pipeline | No automation |
| Environment management | Dev, staging, prod separation | Single environment |
| Secrets management | API keys, credentials handling | Hardcoded secrets |
| Documentation | API docs, runbooks, architecture | Missing/outdated docs |
README.md, AGENTS.md, architecture docs, config filesKey locations to check:
backend/app/agents/ - Agent definitions, tools, crewsbackend/app/core/rag/ - RAG pipeline componentsbackend/app/services/ - Business logicbackend/app/core/memory/ - Memory systemsFor EACH assessment area:
Create agent_audit_report.md with:
# Agent Audit Report
## Executive Summary
(1-2 paragraphs)
## Health Score: X.X/10
| Category | Score | Status |
|----------|-------|--------|
| RAG & Data | X/10 | 🟢/🟡/🔴 |
| Architecture | X/10 | 🟢/🟡/🔴 |
| Output Quality | X/10 | 🟢/🟡/🔴 |
| Error Handling | X/10 | 🟢/🟡/🔴 |
| Observability | X/10 | 🟢/🟡/🔴 |
| Performance | X/10 | 🟢/🟡/🔴 |
| Safety | X/10 | 🟢/🟡/🔴 |
| Production Readiness | X/10 | 🟢/🟡/🔴 |
## 🔴 Critical Issues (Must-Fix Before Production)
...
## 🟡 High-Priority Improvements (Significant Impact)
...
## 🟠 Medium-Priority Enhancements (Nice-to-Haves)
...
## ✅ Strengths (What's Working Well)
...
## Prioritized Roadmap
| Phase | Focus | Effort | Impact |
|-------|-------|--------|--------|
| 1 | ... | Xs | High |
CRUCIAL: Do NOT auto-apply fixes.
agent_audit_report.md to the userimplementation_plan.mdIf no eval framework exists:
sk-, api_key=)Always include: