Monitoring
Made Open ships with an optional monitoring overlay that adds Prometheus, Grafana, and Loki/Promtail for metrics and log aggregation.
Enabling Monitoring
Start the monitoring stack alongside the base services:
# Development
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
# Production
docker compose -f docker-compose.prod.yml -f docker-compose.monitoring.yml up -d
To stop only the monitoring services:
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml down
Accessing Grafana
- URL: http://localhost:3003
- Username:
admin - Password:
admin(or the value ofGRAFANA_PASSWORDenv var)
Two data sources are auto-provisioned on first start:
| Data Source | Type | URL |
|---|---|---|
| Prometheus | Metrics | http://prometheus:9090 |
| Loki | Logs | http://loki:3100 |
Hub Metrics
The hub exposes two metrics endpoints:
GET http://localhost:4101/metrics # JSON snapshot (MetricsService)
GET http://localhost:4101/metrics/prometheus # Prometheus text exposition format
Prometheus scrapes /metrics/prometheus (see monitoring/prometheus.yml). The snapshot includes request counts, latencies, event bus throughput, job queue depth, cache hits/misses, and service health counters.
NATS Monitoring
NATS exposes several monitoring endpoints on port 8222:
| Endpoint | Description |
|---|---|
/varz | General server information and stats |
/connz | Active connections |
/routez | Route information |
/subsz | Subscription details |
/jsz | JetStream account and stream info |
Prometheus scrapes /varz every 15 seconds (global scrape_interval: 15s in monitoring/prometheus.yml).
Meilisearch Stats
Meilisearch exposes index and performance statistics:
GET http://localhost:7700/stats
Prometheus scrapes this endpoint every 15 seconds.
Structured Logging
The hub uses pino for structured JSON logging. When the monitoring stack is running:
- Promtail reads container logs via the Docker socket
- JSON log lines are forwarded to Loki
- Query logs in Grafana using LogQL:
{service="hub"} | json | level="error"
{service="hub"} | json | eventType=~"data\\..*"
{container=~".*nats.*"}
Filter by service name, log level, event type, or any field in the structured log output.
Application-Level Metrics
Beyond infrastructure metrics, the hub exposes application-specific metrics via its services:
Rate Limiter Metrics (RateLimiterService)
The rate limiter tracks per-endpoint request windows:
| Metric | Description |
|---|---|
rate_limit_rejections | Counter: requests rejected by the rate limiter |
Per-policy stats (policy, bucketCount, remaining, resetsAt) are returned by GET /health/detailed under rateLimiter. Configured windows (see apps/hub/src/main.ts): api (100/min), webhook (500/min), agent (20/min), search (200/min), admin (30/min), auth (10 per 5min), oauth-token (20/min).
Circuit Breaker Status (CircuitBreakerService)
Three circuit breakers are registered by default:
| Breaker | Failure Threshold | Timeout | Protects |
|---|---|---|---|
nats | 5 failures | 30s | NATS JetStream connections |
meilisearch | 3 failures | 15s | Search indexing and queries |
openrouter | 5 failures | 60s | LLM API calls |
States: closed (healthy), open (rejecting), half_open (testing recovery). Per-breaker stats are included in the response of GET /health/detailed under circuitBreakers.
Cache Hit Rates (CacheService)
The in-memory LRU cache (TTL: 5min, max: 2000 entries) exposes:
| Metric | Description |
|---|---|
cache_hits | Counter: requests served from cache |
cache_misses | Counter: cache misses requiring fresh computation |
Live size and eviction counts are returned by CacheService.getStats() and surfaced via GET /health/detailed under cache.
NATS Consumer Lag
Monitor JetStream consumer lag to detect processing bottlenecks:
# Check consumer lag via NATS CLI
nats consumer info platform-events hub-consumer
# Or via NATS monitoring endpoint
curl http://localhost:8222/jsz?consumers=true
Key metrics:
num_pending— Messages waiting to be deliverednum_ack_pending— Delivered but not yet acknowledgednum_redelivered— Messages redelivered after timeout
pg-boss Job Queue Depth
Monitor job queue health:
| Queue | Latency Target | What to Watch |
|---|---|---|
realtime | < 100ms | Queue depth > 10 = alert |
interactive | < 5s | Queue depth > 50 = alert |
background | Minutes | Queue depth > 500 = investigate |
pg-boss maintains its own monitoring tables in PostgreSQL. Query directly:
SELECT name, state, count(*)
FROM pgboss.job
GROUP BY name, state
ORDER BY count DESC;
ErrorTrackingService
The error tracking service aggregates errors with fingerprinting:
- Dashboard: errors grouped by fingerprint with occurrence count
- Alert on new error types or spike in existing ones
- Status flow:
open→acknowledged→resolved