Query and manage Grafana dashboards and Prometheus metrics for Happy infrastructure. Covers grafanactl CLI usage, direct Prometheus queries through Grafana proxy, and dashboard-as-code workflows. Use when user asks about metrics, dashboards, monitoring, Grafana, Prometheus, or wants to add/modify panels.
You are the observability operator for the Happy infrastructure. You can query live Prometheus metrics, manage Grafana dashboards as code, and investigate production behavior.
Credentials are stored in the repo root .env file (gitignored). Load them before running commands:
GRAFANA_URL=...
GRAFANA_USER=...
GRAFANA_PASSWORD=...
GRAFANA_PROMETHEUS_UID=...
To load in shell:
set -a; source .env; set +a
All commands below use $GRAFANA_URL, $GRAFANA_USER, $GRAFANA_PASSWORD, and $GRAFANA_PROMETHEUS_UID from the environment.
go install github.com/grafana/grafanactl/cmd/grafanactl@latest
Ensure $HOME/go/bin is on your PATH.
# Load env vars first
set -a; source .env; set +a
# Create a context for the Happy Grafana instance
grafanactl config set contexts.happy.grafana.server "$GRAFANA_URL"
grafanactl config set contexts.happy.grafana.user "$GRAFANA_USER"
grafanactl config set contexts.happy.grafana.password "$GRAFANA_PASSWORD"
grafanactl config set contexts.happy.grafana.org-id 1
# Switch to the context
grafanactl config use-context happy
# Verify
grafanactl config check
Config file lives at ~/Library/Application Support/grafanactl/config.yaml (macOS) or ~/.config/grafanactl/config.yaml (Linux).
grafanactl resources list # List all resource types
grafanactl resources get dashboards # List all dashboards
grafanactl resources get folders # List all folders
grafanactl resources pull dashboards -p ./resources -o json
grafanactl resources pull dashboards/DASHBOARD_ID -p ./resources -o json
# Push all dashboards from ./resources
grafanactl resources push dashboards -p ./resources
# Push a specific dashboard
grafanactl resources push dashboards/DASHBOARD_ID -p ./resources
# IMPORTANT: Use --omit-manager-fields to keep dashboards editable from the Grafana UI
grafanactl resources push dashboards -p ./resources --omit-manager-fields
# Dry run (no changes)
grafanactl resources push dashboards -p ./resources --dry-run
# 1. Pull current state
mkdir -p /tmp/grafana-work
grafanactl resources pull dashboards -p /tmp/grafana-work -o json
# 2. Edit the JSON files (add panels, modify queries, etc.)
# 3. Push back — always use --omit-manager-fields to avoid locking the UI
grafanactl resources push dashboards -p /tmp/grafana-work --omit-manager-fields
Warning: Pushing without
--omit-manager-fieldsmarks the dashboard as "provisioned" and locks it from UI edits. Always include this flag unless you explicitly want CLI-only management.
You can query Prometheus through Grafana's datasource proxy API. This is useful for live investigation without touching the Grafana UI.
curl -s -u "$GRAFANA_USER:$GRAFANA_PASSWORD" \
--data-urlencode 'query=YOUR_PROMQL_HERE' \
"$GRAFANA_URL/api/datasources/proxy/uid/$GRAFANA_PROMETHEUS_UID/api/v1/query" \
| python3 -m json.tool
curl -s -u "$GRAFANA_USER:$GRAFANA_PASSWORD" \
--data-urlencode 'query=YOUR_PROMQL_HERE' \
--data-urlencode 'start=UNIX_TIMESTAMP' \
--data-urlencode 'end=UNIX_TIMESTAMP' \
--data-urlencode 'step=60' \
"$GRAFANA_URL/api/datasources/proxy/uid/$GRAFANA_PROMETHEUS_UID/api/v1/query_range" \
| python3 -m json.tool
curl -s -u "$GRAFANA_USER:$GRAFANA_PASSWORD" \
"$GRAFANA_URL/api/datasources/proxy/uid/$GRAFANA_PROMETHEUS_UID/api/v1/label/__name__/values" \
| python3 -c "import json,sys; [print(n) for n in json.load(sys.stdin)['data']]"
# Find all RPC-related metrics
curl -s -u "$GRAFANA_USER:$GRAFANA_PASSWORD" \
"$GRAFANA_URL/api/datasources/proxy/uid/$GRAFANA_PROMETHEUS_UID/api/v1/label/__name__/values" \
| python3 -c "import json,sys; [print(n) for n in json.load(sys.stdin)['data'] if 'rpc' in n.lower()]"
| Metric | Type | Description |
|---|---|---|
rpc_calls_total | counter | RPC calls by method and result (success, not_available, target_disconnected, timeout) |
rpc_call_duration_seconds_bucket | histogram | RPC call duration by method |
rpc_lookup_retries_bucket | histogram | Number of retries per socket lookup by method |
rpc_fetchsockets_timeouts_total | counter | fetchSockets timeout count by context (lookup, presence) |
websocket_connections_total | gauge | Active WebSocket connections by type |
websocket_events_total | counter | WebSocket events by type |
http_requests_total | counter | HTTP requests by method, route, status |
http_request_duration_seconds_bucket | histogram | HTTP request duration by route |
session_cache_operations_total | counter | Session cache hits/misses by operation |
session_alive_events_total | counter | Session keepalive events |
machine_alive_events_total | counter | Machine keepalive events |
database_records_total | gauge | Record counts by table |
database_updates_skipped_total | counter | Skipped DB updates by type |
# RPC success rate by method
sum by(method) (rate(rpc_calls_total{result="success"}[5m]))
/ (sum by(method) (rate(rpc_calls_total[5m])))
# RPC failures by method and reason
sum by (method, result) (rate(rpc_calls_total{result!="success"}[5m]))
# RPC failures by type only
sum by (result) (rate(rpc_calls_total{result!="success"}[5m]))
# RPC P95 latency by method
histogram_quantile(0.95, sum by (method, le) (rate(rpc_call_duration_seconds_bucket[5m])))
# Socket lookup retry distribution (P95)
histogram_quantile(0.95, sum by (method, le) (rate(rpc_lookup_retries_bucket[5m])))
# fetchSockets timeout rate by context
sum by (context) (rate(rpc_fetchsockets_timeouts_total[5m]))
# HTTP error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Top routes by request rate
topk(10, sum by(method, route) (rate(http_requests_total[5m])))
470da978-91f7-4721-be2c-cc451bf074a2When adding panels to a dashboard JSON, follow this pattern:
{"type": "prometheus", "uid": "$GRAFANA_PROMETHEUS_UID"}id (check existing panels for max id)gridPos: h = height (8 standard), w = width (12 half, 24 full), x = column (0 or 12), y = rowstat, timeseries, piechart, bargauge, tableunit: percentunit, ops, s, short, reqps--dry-run on push to preview changes--omit-manager-fields flag is essential for hybrid CLI+UI workflowsdate +%s to get current time使用 Arthas 的 watch/trace 获取 EagleEye traceId / 获取请求的 traceId