Prometheus
Configuración y uso de Prometheus
Overview
Prometheus es el sistema de monitoreo de métricas para todos los servicios.
URL: http://localhost:9090 (interno)
Configuración
Archivo Principal
/srv/prometheus/config/prometheus.yml
Estructura
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- /srv/prometheus/rules/*.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
- job_name: 'caddy'
static_configs:
- targets: ['localhost:2019']Agregar Nuevo Servicio
1. Exponer métricas en la app
Python (con prometheus_client):
from prometheus_client import Counter, Histogram, generate_latest
from fastapi import Response
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'Request latency in seconds',
['endpoint']
)
@app.get("/metrics")
def metrics():
return Response(
generate_latest(),
media_type="text/plain"
)2. Agregar a prometheus.yml
scrape_configs:
# ... otros jobs ...
- job_name: 'mi-proyecto'
static_configs:
- targets: ['localhost:8105']
metrics_path: /metrics3. Recargar Prometheus
curl -X POST http://localhost:9090/-/reload4. Verificar
# Verificar targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="mi-proyecto")'
# Verificar métricas
curl -s http://localhost:9090/api/v1/query?query=up{job="mi-proyecto"}PromQL Queries
Básicas
# Uptime de un servicio
up{job="mi-proyecto"}
# Todas las métricas de un job
{job="mi-proyecto"}
# Filtrar por label
http_requests_total{method="POST", endpoint="/api/users"}
Agregaciones
# Requests por segundo (rate)
rate(http_requests_total[5m])
# Suma de requests por endpoint
sum by (endpoint) (rate(http_requests_total[5m]))
# Promedio de latencia
avg(http_request_duration_seconds)
# Percentil 99 de latencia
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Alertas (en reglas)
# Servicio caído
up == 0
# Latencia alta
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
# Error rate alto
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
Reglas de Alerta
Archivo de reglas
/srv/prometheus/rules/alerts.yml
Ejemplo
groups:
- name: services
rules:
- alert: ServiceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} está caído"
description: "El servicio {{ $labels.job }} no responde hace más de 2 minutos"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Latencia alta en {{ $labels.job }}"
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Error rate alto en {{ $labels.job }}"Comandos Útiles
# Estado de Prometheus
systemctl status prometheus
# Logs
journalctl -u prometheus -f
# Verificar config
promtool check config /srv/prometheus/config/prometheus.yml
# Verificar reglas
promtool check rules /srv/prometheus/rules/alerts.yml
# Recargar config
curl -X POST http://localhost:9090/-/reload
# API: targets
curl http://localhost:9090/api/v1/targets
# API: query
curl 'http://localhost:9090/api/v1/query?query=up'Troubleshooting
Target “down”
# Verificar que el servicio está corriendo
systemctl status mi-proyecto.service
# Verificar endpoint de métricas
curl http://localhost:8105/metricsMétricas no aparecen
# Verificar scrape interval
curl -s http://localhost:9090/api/v1/status/config | jq '.data.global.scrape_interval'
# Ver errores de scrape
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'