Use when working with Opentelemetry — openTelemetry collector management, pipeline configuration, exporter health, instrumentation analysis, and receiver status. Covers collector metrics, pipeline topology, processor performance, batch/queue monitoring, and SDK configuration review. Use when managing OTel collectors, analyzing pipeline health, reviewing exporter status, or troubleshooting instrumentation.
Monitor and manage OpenTelemetry collectors, pipelines, and instrumentation health.
OTel collectors expose health and metrics endpoints:
http://<collector>:13133/ (health_check extension)http://<collector>:8888/metrics (Prometheus format)http://<collector>:55679/debug/tracez (debug extension)http://<collector>:1777/debug/pprof/ (performance profiling)grep and awk#!/bin/bash
otel_metrics() {
local host="${1:-localhost}"
local port="${2:-8888}"
curl -s "http://${host}:${port}/metrics"
}
otel_health() {
local host="${1:-localhost}"
local port="${2:-13133}"
curl -s "http://${host}:${port}/"
}
otel_zpages() {
local host="${1:-localhost}"
local port="${2:-55679}"
local path="${3:-tracez}"
curl -s "http://${host}:${port}/debug/${path}"
}
# Parse Prometheus metrics for a specific metric name
otel_metric_value() {
local host="$1"
local metric_name="$2"
otel_metrics "$host" | grep "^${metric_name}" | grep -v "^#"
}
{
otel_health "collector-1" &
otel_health "collector-2" &
otel_metrics "collector-1" | grep "otelcol_receiver" &
otel_metrics "collector-2" | grep "otelcol_receiver" &
}
wait
NEVER assume collector endpoints, pipeline names, or exporter types. ALWAYS discover first.
#!/bin/bash
COLLECTOR="${1:-localhost}"
echo "=== Collector Health ==="
otel_health "$COLLECTOR"
echo "=== Active Pipelines ==="
otel_metrics "$COLLECTOR" | grep "otelcol_process" | grep -v "^#" | head -5
echo "=== Configured Receivers ==="
otel_metrics "$COLLECTOR" | grep "otelcol_receiver_accepted" | grep -v "^#" \
| sed 's/.*receiver="\([^"]*\)".*/\1/' | sort -u
echo "=== Configured Exporters ==="
otel_metrics "$COLLECTOR" | grep "otelcol_exporter_sent" | grep -v "^#" \
| sed 's/.*exporter="\([^"]*\)".*/\1/' | sort -u
echo "=== Configured Processors ==="
otel_metrics "$COLLECTOR" | grep "otelcol_processor" | grep -v "^#" \
| sed 's/.*processor="\([^"]*\)".*/\1/' | sort -u
#!/bin/bash
COLLECTOR="${1:-localhost}"
echo "=== Collector Process Metrics ==="
{
echo "--- Uptime & Resource Usage ---"
otel_metric_value "$COLLECTOR" "otelcol_process_uptime" &
otel_metric_value "$COLLECTOR" "otelcol_process_memory_rss" &
otel_metric_value "$COLLECTOR" "otelcol_process_cpu_seconds" &
}
wait
echo ""
echo "=== Build Info ==="
otel_metrics "$COLLECTOR" | grep "otelcol_build_info" | grep -v "^#"
#!/bin/bash
COLLECTOR="${1:-localhost}"
echo "=== Receiver Throughput (accepted vs refused) ==="
otel_metrics "$COLLECTOR" | grep -E "otelcol_receiver_(accepted|refused)_" | grep -v "^#" \
| awk -F'[{}]' '{split($2,a,","); for(i in a) if(a[i] ~ /receiver=/) print a[i], $0}' \
| head -20
echo ""
echo "=== Exporter Throughput (sent vs failed) ==="
otel_metrics "$COLLECTOR" | grep -E "otelcol_exporter_(sent|send_failed)_" | grep -v "^#" \
| head -20
echo ""
echo "=== Processor Metrics ==="
otel_metrics "$COLLECTOR" | grep "otelcol_processor" | grep -v "^#" | head -15
#!/bin/bash
COLLECTOR="${1:-localhost}"
echo "=== Exporter Queue Size ==="
otel_metric_value "$COLLECTOR" "otelcol_exporter_queue_size"
otel_metric_value "$COLLECTOR" "otelcol_exporter_queue_capacity"
echo ""
echo "=== Batch Processor Stats ==="
otel_metric_value "$COLLECTOR" "otelcol_processor_batch_batch_send_size_sum"
otel_metric_value "$COLLECTOR" "otelcol_processor_batch_batch_send_size_count"
otel_metric_value "$COLLECTOR" "otelcol_processor_batch_timeout_trigger_send"
echo ""
echo "=== Retry Queue ==="
otel_metric_value "$COLLECTOR" "otelcol_exporter_enqueue_failed_spans"
otel_metric_value "$COLLECTOR" "otelcol_exporter_enqueue_failed_metric_points"
otel_metric_value "$COLLECTOR" "otelcol_exporter_enqueue_failed_log_records"
#!/bin/bash
# Review collector config file
CONFIG_PATH="${1:-/etc/otelcol/config.yaml}"
echo "=== Collector Configuration ==="
if [ -f "$CONFIG_PATH" ]; then
echo "--- Receivers ---"
grep -A2 "^receivers:" "$CONFIG_PATH" | head -10
echo "--- Processors ---"
grep -A2 "^processors:" "$CONFIG_PATH" | head -10
echo "--- Exporters ---"
grep -A2 "^exporters:" "$CONFIG_PATH" | head -10
echo "--- Service Pipelines ---"
grep -A10 "^service:" "$CONFIG_PATH" | head -15
else
echo "Config file not found at $CONFIG_PATH"
fi
#!/bin/bash
COLLECTORS="${@:-collector-1 collector-2 collector-3}"
echo "=== Fleet Health Summary ==="
for host in $COLLECTORS; do
{
status=$(otel_health "$host" 2>/dev/null && echo "UP" || echo "DOWN")
if [ "$status" = "UP" ]; then
uptime=$(otel_metric_value "$host" "otelcol_process_uptime" | awk '{print $NF}')
mem=$(otel_metric_value "$host" "otelcol_process_memory_rss" | awk '{printf "%.0fMB", $NF/1048576}')
echo "$host\t$status\tuptime:${uptime}s\tmem:${mem}"
else
echo "$host\t$status"
fi
} &
done
wait
Present results as a structured report:
Monitoring Opentelemetry Report
═══════════════════════════════
Resources discovered: [count]
Resource Status Key Metric Issues
──────────────────────────────────────────────
[name] [ok/warn] [value] [findings]
Summary: [total] resources | [ok] healthy | [warn] warnings | [crit] critical
Action Items: [list of prioritized findings]
Target ≤50 lines of output. Use tables for multi-resource comparisons.
| Shortcut | Counter | Why |
|---|---|---|
| "I'll skip discovery and check known resources" | Always run Phase 1 discovery first | Resource names change, new resources appear — assumed names cause errors |
| "The user only asked for a quick check" | Follow the full discovery → analysis flow | Quick checks miss critical issues; structured analysis catches silent failures |
| "Default configuration is probably fine" | Audit configuration explicitly | Defaults often leave logging, security, and optimization features disabled |
| "Metrics aren't needed for this" | Always check relevant metrics when available | API/CLI responses show current state; metrics reveal trends and intermittent issues |
| "I don't have access to that" | Try the command and report the actual error | Assumed permission failures prevent useful investigation; actual errors are informative |
traces, metrics, logs) — filter accordinglyotelcol_exporter_queue_size vs queue_capacity — near-capacity indicates backpressureotelcol_receiver_refused_* and otelcol_exporter_send_failed_* for data lossotelcol validate --config=config.yaml before applying changesotelcol_processor_refused_spans for memory pressure drops