Logs, métricas, tracing e SLOs para operação confiável.
Construir stack de observabilidade completa com monitoramento proativo, implementando os três pilares: Logs, Métricas e Traces para operação confiável.
docs/06-arquitetura/arquitetura.md)Sistema: [DESCREVA]
Stack: [TECNOLOGIAS]
Ambientes: [dev, staging, prod]
Defina estratégia de logging:
1. Níveis de log por ambiente:
- Dev: debug
- Staging: info
- Prod: info (debug pontual)
2. Formato estruturado (JSON):
- timestamp
- level
- message
- correlationId/traceId
- userId (se autenticado)
- metadata contextual
3. O que logar:
- Requests HTTP (entrada/saída)
- Erros com stack trace
- Eventos de negócio importantes
- Decisões de sistema
4. O que NUNCA logar:
- PII sem mascaramento
- Secrets/tokens
- Bodies muito grandes
// Exemplo Node.js com Winston
const winston = require('winston');
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: { service: 'api-service' }
});
// Middleware de logging de requisições
const requestLogger = (req, res, next) => {
const start = Date.now();
logger.info('Request started', {
method: req.method,
url: req.url,
userAgent: req.get('User-Agent'),
correlationId: req.headers['x-correlation-id']
});
res.on('finish', () => {
const duration = Date.now() - start;
logger.info('Request completed', {
method: req.method,
url: req.url,
statusCode: res.statusCode,
duration,
correlationId: req.headers['x-correlation-id']
});
});
next();
};
Defina métricas usando RED/USE:
**RED (para serviços):**
- Rate: requisições por segundo
- Errors: taxa de erro
- Duration: latência (p50, p95, p99)
**USE (para recursos):**
- Utilization: % de uso
- Saturation: fila/espera
- Errors: erros do recurso
**Business Metrics:**
- [métrica específica do domínio]
- [métricas de usuário]
- [métricas de negócio]
const prometheus = require('prom-client');
const httpRequestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.5, 1, 2, 5, 10]
});
const httpRequestTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status']
});
// Middleware de métricas
const metricsMiddleware = (req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration
.labels(req.method, req.route, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, req.route, res.statusCode)
.inc();
});
next();
};
// Exemplo com OpenTelemetry
const { NodeSDK } = require('@opentelemetry/api');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const sdk = NodeSDK.start({
serviceName: 'api-service',
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'api-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production'
})
});
// Tracing middleware
const tracingMiddleware = (req, res, next) => {
const tracer = sdk.getTracer('api-server');
const span = tracer.startSpan('http-request');
span.setAttributes({
'http.method': req.method,
'http.url': req.url,
'http.target': req.url
});
res.on('finish', () => {
span.setAttributes({
'http.status_code': res.statusCode
});
span.end();
});
next();
};
Service Level Objectives (SLOs):
- **Availability:** 99.9% (43min downtime/mês)
- **Latency:** p95 < 200ms, p99 < 500ms
- **Error Rate:** < 0.1%
- **Throughput:** > 1000 RPS
Service Level Indicators (SLIs):
- **Availability:** Uptime percentage
- **Latency:** Response time percentiles
- **Error Rate:** Error percentage
- **Throughput:** Requests per second
Error Budget Calculation:
- Target Availability: 99.9%
- Monthly Budget: 43.2 minutes
- Current Month: [minutos utilizados]
- Budget Remaining: [minutos restantes]
- Alert Threshold: 80% do budget
**Critical (PagerDuty):**
- Service down
- Error rate > 5%
- Latency p99 > 1s
- Database connections exhausted
**Warning (Slack):**
- Error rate > 1%
- Latency p95 > 500ms
- High memory usage > 80%
- Queue depth > 100
**Info (Email):**
- Deployments
- Configuration changes
- Performance degradation
- New alerts created
# Prometheus Alertmanager