スキル内容

Metrics & Grafana

You are the observability operator for the Happy infrastructure. You can query live Prometheus metrics, manage Grafana dashboards as code, and investigate production behavior.

Environment Variables

Credentials are stored in the repo root .env file (gitignored). Load them before running commands:

GRAFANA_URL=...
GRAFANA_USER=...
GRAFANA_PASSWORD=...
GRAFANA_PROMETHEUS_UID=...

To load in shell:

set -a; source .env; set +a

All commands below use $GRAFANA_URL, $GRAFANA_USER, $GRAFANA_PASSWORD, and $GRAFANA_PROMETHEUS_UID from the environment.

Prerequisites

go install github.com/grafana/grafanactl/cmd/grafanactl@latest

# Load env vars first
set -a; source .env; set +a

# Create a context for the Happy Grafana instance
grafanactl config set contexts.happy.grafana.server "$GRAFANA_URL"
grafanactl config set contexts.happy.grafana.user "$GRAFANA_USER"
grafanactl config set contexts.happy.grafana.password "$GRAFANA_PASSWORD"
grafanactl config set contexts.happy.grafana.org-id 1

# Switch to the context
grafanactl config use-context happy

# Verify
grafanactl config check

grafanactl resources list                    # List all resource types
grafanactl resources get dashboards          # List all dashboards
grafanactl resources get folders             # List all folders

grafanactl resources pull dashboards -p ./resources -o json
grafanactl resources pull dashboards/DASHBOARD_ID -p ./resources -o json

# Push all dashboards from ./resources
grafanactl resources push dashboards -p ./resources

# Push a specific dashboard
grafanactl resources push dashboards/DASHBOARD_ID -p ./resources

# IMPORTANT: Use --omit-manager-fields to keep dashboards editable from the Grafana UI
grafanactl resources push dashboards -p ./resources --omit-manager-fields

# Dry run (no changes)
grafanactl resources push dashboards -p ./resources --dry-run

# 1. Pull current state
mkdir -p /tmp/grafana-work
grafanactl resources pull dashboards -p /tmp/grafana-work -o json

# 2. Edit the JSON files (add panels, modify queries, etc.)

# 3. Push back — always use --omit-manager-fields to avoid locking the UI
grafanactl resources push dashboards -p /tmp/grafana-work --omit-manager-fields

curl -s -u "$GRAFANA_USER:$GRAFANA_PASSWORD" \
  --data-urlencode 'query=YOUR_PROMQL_HERE' \
  "$GRAFANA_URL/api/datasources/proxy/uid/$GRAFANA_PROMETHEUS_UID/api/v1/query" \
  | python3 -m json.tool

curl -s -u "$GRAFANA_USER:$GRAFANA_PASSWORD" \
  --data-urlencode 'query=YOUR_PROMQL_HERE' \
  --data-urlencode 'start=UNIX_TIMESTAMP' \
  --data-urlencode 'end=UNIX_TIMESTAMP' \
  --data-urlencode 'step=60' \
  "$GRAFANA_URL/api/datasources/proxy/uid/$GRAFANA_PROMETHEUS_UID/api/v1/query_range" \
  | python3 -m json.tool

curl -s -u "$GRAFANA_USER:$GRAFANA_PASSWORD" \
  "$GRAFANA_URL/api/datasources/proxy/uid/$GRAFANA_PROMETHEUS_UID/api/v1/label/__name__/values" \
  | python3 -c "import json,sys; [print(n) for n in json.load(sys.stdin)['data']]"

# Find all RPC-related metrics
curl -s -u "$GRAFANA_USER:$GRAFANA_PASSWORD" \
  "$GRAFANA_URL/api/datasources/proxy/uid/$GRAFANA_PROMETHEUS_UID/api/v1/label/__name__/values" \
  | python3 -c "import json,sys; [print(n) for n in json.load(sys.stdin)['data'] if 'rpc' in n.lower()]"

Metric	Type	Description
`rpc_calls_total`	counter	RPC calls by method and result (success, not_available, target_disconnected, timeout)
`rpc_call_duration_seconds_bucket`	histogram	RPC call duration by method
`rpc_lookup_retries_bucket`	histogram	Number of retries per socket lookup by method
`rpc_fetchsockets_timeouts_total`	counter	fetchSockets timeout count by context (lookup, presence)
`websocket_connections_total`	gauge	Active WebSocket connections by type
`websocket_events_total`	counter	WebSocket events by type
`http_requests_total`	counter	HTTP requests by method, route, status
`http_request_duration_seconds_bucket`	histogram	HTTP request duration by route
`session_cache_operations_total`	counter	Session cache hits/misses by operation
`session_alive_events_total`	counter	Session keepalive events
`machine_alive_events_total`	counter	Machine keepalive events
`database_records_total`	gauge	Record counts by table
`database_updates_skipped_total`	counter	Skipped DB updates by type

# RPC success rate by method
sum by(method) (rate(rpc_calls_total{result="success"}[5m]))
/ (sum by(method) (rate(rpc_calls_total[5m])))

# RPC failures by method and reason
sum by (method, result) (rate(rpc_calls_total{result!="success"}[5m]))

# RPC failures by type only
sum by (result) (rate(rpc_calls_total{result!="success"}[5m]))

# RPC P95 latency by method
histogram_quantile(0.95, sum by (method, le) (rate(rpc_call_duration_seconds_bucket[5m])))

# Socket lookup retry distribution (P95)
histogram_quantile(0.95, sum by (method, le) (rate(rpc_lookup_retries_bucket[5m])))

# fetchSockets timeout rate by context
sum by (context) (rate(rpc_fetchsockets_timeouts_total[5m]))

# HTTP error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Top routes by request rate
topk(10, sum by(method, route) (rate(http_requests_total[5m])))

Metrics Graphana | Skills Pool

Metrics Graphana

Metrics Graphana

Metrics & Grafana

Environment Variables

Prerequisites

Install grafanactl

Configure grafanactl

grafanactl CLI Reference

List resources

Pull dashboards (export to disk)

Push dashboards (deploy from disk)

Workflow: Edit a dashboard

Querying Prometheus Directly

Instant query (current value)

Range query (time series)

List all metric names

Filter metric names

Key Metrics

Application metrics (handy-server)

Useful PromQL queries

Dashboards

Happy Server Application Metrics

Adding a panel

Tips

Bluebubbles

Add Tracing

Analytics Events

Add Expert

Arthas

Arthas Eagleeye Traceid