Generate/create/write PromQL queries, metric expressions, alerting rules, recording rules, Prometheus dashboards.
This skill provides a comprehensive, interactive workflow for generating production-ready PromQL queries with best practices built-in. Generate queries for monitoring dashboards, alerting rules, and ad-hoc analysis with an emphasis on user collaboration and planning before code generation.
Invoke this skill when:
CRITICAL: This skill emphasizes interactive planning before query generation. Always engage the user in a collaborative planning process to ensure the generated query matches their exact intentions.
Follow this workflow when generating PromQL queries:
Start by understanding what the user wants to monitor or measure. Ask clarifying questions to gather requirements:
Primary Goal: What are you trying to monitor or measure?
Use Case: What will this query be used for?
Context: Any additional context?
Use the AskUserQuestion tool to gather this information if not provided.
When to Ask vs. Infer: If the user's initial request already clearly specifies the goal, use case, and context (e.g., "Create an alert for P95 latency > 500ms for payment-service"), you may acknowledge these details in your response instead of re-asking. Only ask clarifying questions for information that is missing or ambiguous.
Determine which metrics are available and relevant:
Metric Discovery: What metrics are available?
_total suffix → Counter_bucket, _sum, _count suffix → Histogram_created suffix → Counter creation timestampMetric Type Identification: Confirm the metric type(s)
http_requests_total, errors_total, bytes_sent_totalrate(), irate(), increase()memory_usage_bytes, cpu_temperature_celsius, queue_lengthavg_over_time(), min_over_time(), max_over_time(), or directlyhttp_request_duration_seconds_bucket, response_size_bytes_buckethistogram_quantile(), rate()rpc_duration_seconds{quantile="0.95"}_sum and _count for averages; don't average quantilesLabel Discovery: What labels are available on these metrics?
job, instance, environment, service, endpoint, status_code, methodUse the AskUserQuestion tool to confirm metric names, types, and available labels.
Gather specific requirements for the query.
IMPORTANT: When the user has already specified parameters in their initial request (e.g., "5-minute window", "500ms threshold", "> 5% error rate"), you MUST:
Example: If user says "alert when P95 latency exceeds 500ms", use:
AskUserQuestion:
- Question: "Confirm the alert threshold?"
- Options:
1. "500ms (as specified)" - Use the threshold from your request
2. "Different threshold" - Let me specify a different value
This respects the user's input and speeds up the workflow while still allowing modifications.
Time Range: What time window should the query cover?
[5m], [1h], [1d])[1m] to [5m] for real-time, [1h] to [1d] for trendsLabel Filtering: Which labels should filter the data?
job="api-server", status_code="200"status_code!="200"instance=~"prod-.*"{job="api", environment="production"}Aggregation: Should the data be aggregated?
sum by (job, endpoint), avg by (instance)sum without (instance, pod), avg without (job)sum, avg, max, min, count, topk, bottomkThresholds or Conditions: Are there specific conditions?
Use the AskUserQuestion tool to gather or confirm these parameters. When the user has already provided values (e.g., "5-minute window", "> 5%"), present them as the default option for confirmation.
BEFORE GENERATING ANY CODE, present a plain-English query plan and ask for user confirmation:
## PromQL Query Plan
Based on your requirements, here's what the query will do:
**Goal**: [Describe the monitoring goal in plain English]
**Query Structure**:
1. Start with metric: `[metric_name]`
2. Filter by labels: `{label1="value1", label2="value2"}`
3. Apply function: `[function_name]([metric][time_range])`
4. Aggregate: `[aggregation] by ([label_list])`
5. Additional operations: [any calculations, ratios, or transformations]
**Expected Output**:
- Data type: [instant vector/scalar]
- Labels in result: [list of labels]
- Value represents: [what the number means]
- Typical range: [expected value range]
**Example Interpretation**:
If the query returns `0.05`, it means: [plain English explanation]
**Does this match your intentions?**
- If yes, I'll generate the query and validate it
- If no, let me know what needs to change
Use the AskUserQuestion tool to confirm the plan with options:
When the user chooses:
Once the user confirms the plan, generate the actual PromQL query following best practices.
Before writing any query code, you MUST:
Identify the query category first (histogram, RED, USE, function-specific, optimization, etc.).
Read only the relevant reference section(s) using the Read tool:
references/metric_types.md (Histogram section)references/promql_patterns.md (RED method section)references/promql_patterns.md (USE method section)references/best_practices.mdreferences/promql_functions.mdIf a needed reference cannot be read, state the issue and continue with best-effort generation using the most applicable documented pattern you already have.
Cite the applicable pattern or best practice in your response:
As documented in references/promql_patterns.md (Pattern 3: Latency Percentile):
# 95th percentile latency
histogram_quantile(0.95, sum by (le) (rate(...)))
Reference example files when generating similar queries:
Based on examples/red_method.promql (lines 64-82):
# P95 latency with proper histogram_quantile usage
This keeps generated queries aligned with documented patterns while avoiding unnecessary full-file rereads on iterative follow-ups.
Always Use Label Filters
# Good: Specific filtering reduces cardinality
rate(http_requests_total{job="api-server", environment="prod"}[5m])
# Bad: Matches all time series, high cardinality
rate(http_requests_total[5m])
Use Appropriate Functions for Metric Types
# Counter: Use rate() or increase()
rate(http_requests_total[5m])
# Gauge: Use directly or with *_over_time()
memory_usage_bytes
avg_over_time(memory_usage_bytes[5m])
# Histogram: Use histogram_quantile()
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
Apply Aggregations with by() or without()
# Aggregate by specific labels (keeps only these labels)
sum by (job, endpoint) (rate(http_requests_total[5m]))
# Aggregate without specific labels (removes these labels)
sum without (instance, pod) (rate(http_requests_total[5m]))
Use Exact Matches Over Regex When Possible
# Good: Faster exact match
http_requests_total{status_code="200"}
# Bad: Slower regex match when not needed
http_requests_total{status_code=~"200"}
Calculate Ratios Properly
# Error rate: errors / total requests
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Use Recording Rules for Complex Queries
level:metric:operationsFormat for Readability
# Good: Multi-line for complex queries
histogram_quantile(0.95,
sum by (le, job) (
rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
)
)
Pattern 1: Request Rate
# Requests per second
rate(http_requests_total{job="api-server"}[5m])
# Total requests per second across all instances
sum(rate(http_requests_total{job="api-server"}[5m]))
Pattern 2: Error Rate
# Error ratio (0 to 1)
sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api-server"}[5m]))
# Error percentage (0 to 100)
(
sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api-server"}[5m]))
) * 100
Pattern 3: Latency Percentile (Histogram)
# 95th percentile latency
histogram_quantile(0.95,
sum by (le) (
rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
)
)
Pattern 4: Resource Usage
# Current memory usage
process_resident_memory_bytes{job="api-server"}
# Average CPU usage over 5 minutes
avg_over_time(process_cpu_seconds_total{job="api-server"}[5m])
Pattern 5: Availability
# Percentage of up instances
(
count(up{job="api-server"} == 1)
/
count(up{job="api-server"})
) * 100
Pattern 6: Saturation/Queue Depth
# Average queue length
avg_over_time(queue_depth{job="worker"}[5m])
# Maximum queue depth in the last hour
max_over_time(queue_depth{job="worker"}[1h])
ALWAYS attempt to validate the generated query first using the devops-skills:promql-validator skill:
After generating the query, automatically invoke:
Skill(devops-skills:promql-validator)
The devops-skills:promql-validator skill will:
1. Check syntax correctness
2. Validate semantic logic (correct functions for metric types)
3. Identify anti-patterns and inefficiencies
4. Suggest optimizations
5. Explain what the query does
6. Verify it matches user intent
Validation checklist:
If validation fails, fix issues and re-validate until all checks pass.
If the validator skill is unavailable, fails to run, or cannot complete after two fix/re-validate cycles:
IMPORTANT: Display Validation Results to User
After running validation, you MUST display the structured results to the user in this format:
## PromQL Validation Results
### Syntax Check
- Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR / ⚠️ UNVERIFIED
- Issues: [list any syntax errors]
### Best Practices Check
- Status: ✅ OPTIMIZED / ⚠️ CAN BE IMPROVED / ❌ HAS ISSUES / ⚠️ UNVERIFIED
- Issues: [list any problems found]
- Suggestions: [list optimization opportunities]
### Validation Coverage
- Validator tool run: [successful / failed / unavailable]
- Checks completed: [syntax, semantics, anti-patterns, performance, intent-match]
- Checks skipped: [list any skipped checks, or "None"]
### Query Explanation
- **What it measures**: [plain English description]
- **Output labels**: [list labels in result, or "None (scalar)"]
- **Expected result structure**: [instant vector / scalar / etc.]
This transparency helps users understand the validation process and any recommendations.
After generation and validation (or manual fallback validation), provide the user with:
The Final Query:
[Generated and validated PromQL query]
Query Explanation:
How to Use It:
Customization Notes:
Related Queries:
Native histograms are now stable in Prometheus 3.0+ (released November 2024). They offer significant advantages over classic histograms:
Important: Starting with Prometheus v3.8.0, native histograms are fully stable. However, scraping native histograms still requires explicit activation via the
scrape_native_histogramsconfiguration setting. Starting with v3.9, no feature flag is needed butscrape_native_histogramsmust be set explicitly.
# Classic histogram (requires _bucket suffix and le label)
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Native histogram (simpler - no _bucket suffix, no le label needed)
histogram_quantile(0.95,
sum by (job) (rate(http_request_duration_seconds[5m]))
)
# Get observation count rate from native histogram
histogram_count(rate(http_request_duration_seconds[5m]))
# Get sum of observations from native histogram
histogram_sum(rate(http_request_duration_seconds[5m]))
# Calculate fraction of observations between two values
histogram_fraction(0, 0.1, rate(http_request_duration_seconds[5m]))
# Average request duration from native histogram
histogram_sum(rate(http_request_duration_seconds[5m]))
/
histogram_count(rate(http_request_duration_seconds[5m]))
Native histograms are identified by:
_bucket suffix on the metric namele label in the time seriesWhen querying, check if your Prometheus instance has native histograms enabled:
# prometheus.yml - Enable native histogram scraping
scrape_configs:
- job_name: 'my-app'
scrape_native_histogram: true # Prometheus 3.x+
Prometheus 3.4+ supports custom bucket native histograms (schema -53), allowing classic histogram to native histogram conversion. This is a key migration path for users with existing classic histograms.
Benefits of NHCB:
Configuration (Prometheus 3.4+):
# prometheus.yml - Convert classic histograms to NHCB on scrape