Monitoring

Made Open ships with an optional monitoring overlay that adds Prometheus, Grafana, and Loki/Promtail for metrics and log aggregation.

Enabling Monitoring

Start the monitoring stack alongside the base services:

# Development
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d

# Production
docker compose -f docker-compose.prod.yml -f docker-compose.monitoring.yml up -d

To stop only the monitoring services:

docker compose -f docker-compose.yml -f docker-compose.monitoring.yml down

Accessing Grafana

URL: http://localhost:3003
Username: admin
Password: admin (or the value of GRAFANA_PASSWORD env var)

Two data sources are auto-provisioned on first start:

Data Source	Type	URL
Prometheus	Metrics	http://prometheus:9090
Loki	Logs	http://loki:3100

Hub Metrics

The hub exposes two metrics endpoints:

GET http://localhost:4101/metrics              # JSON snapshot (MetricsService)
GET http://localhost:4101/metrics/prometheus   # Prometheus text exposition format

Prometheus scrapes /metrics/prometheus (see monitoring/prometheus.yml). The snapshot includes request counts, latencies, event bus throughput, job queue depth, cache hits/misses, and service health counters.

NATS Monitoring

NATS exposes several monitoring endpoints on port 8222:

Endpoint	Description
`/varz`	General server information and stats
`/connz`	Active connections
`/routez`	Route information
`/subsz`	Subscription details
`/jsz`	JetStream account and stream info

Prometheus scrapes /varz every 15 seconds (global scrape_interval: 15s in monitoring/prometheus.yml).

Meilisearch Stats

Meilisearch exposes index and performance statistics:

GET http://localhost:7700/stats

Prometheus scrapes this endpoint every 15 seconds.

Structured Logging

The hub uses pino for structured JSON logging. When the monitoring stack is running:

Promtail reads container logs via the Docker socket
JSON log lines are forwarded to Loki
Query logs in Grafana using LogQL:

{service="hub"} | json | level="error"
{service="hub"} | json | eventType=~"data\\..*"
{container=~".*nats.*"}

Filter by service name, log level, event type, or any field in the structured log output.

Application-Level Metrics

Beyond infrastructure metrics, the hub exposes application-specific metrics via its services:

Rate Limiter Metrics (RateLimiterService)

The rate limiter tracks per-endpoint request windows:

Metric	Description
`rate_limit_rejections`	Counter: requests rejected by the rate limiter

Per-policy stats (policy, bucketCount, remaining, resetsAt) are returned by GET /health/detailed under rateLimiter. Configured windows (see apps/hub/src/main.ts): api (100/min), webhook (500/min), agent (20/min), search (200/min), admin (30/min), auth (10 per 5min), oauth-token (20/min).

Circuit Breaker Status (CircuitBreakerService)

Three circuit breakers are registered by default:

Breaker	Failure Threshold	Timeout	Protects
`nats`	5 failures	30s	NATS JetStream connections
`meilisearch`	3 failures	15s	Search indexing and queries
`openrouter`	5 failures	60s	LLM API calls

States: closed (healthy), open (rejecting), half_open (testing recovery). Per-breaker stats are included in the response of GET /health/detailed under circuitBreakers.

Cache Hit Rates (CacheService)

The in-memory LRU cache (TTL: 5min, max: 2000 entries) exposes:

Metric	Description
`cache_hits`	Counter: requests served from cache
`cache_misses`	Counter: cache misses requiring fresh computation

Live size and eviction counts are returned by CacheService.getStats() and surfaced via GET /health/detailed under cache.

NATS Consumer Lag

Monitor JetStream consumer lag to detect processing bottlenecks:

# Check consumer lag via NATS CLI
nats consumer info platform-events hub-consumer

# Or via NATS monitoring endpoint
curl http://localhost:8222/jsz?consumers=true

Key metrics:

num_pending — Messages waiting to be delivered
num_ack_pending — Delivered but not yet acknowledged
num_redelivered — Messages redelivered after timeout

pg-boss Job Queue Depth

Monitor job queue health:

Queue	Latency Target	What to Watch
`realtime`	< 100ms	Queue depth > 10 = alert
`interactive`	< 5s	Queue depth > 50 = alert
`background`	Minutes	Queue depth > 500 = investigate

pg-boss maintains its own monitoring tables in PostgreSQL. Query directly:

SELECT name, state, count(*)
FROM pgboss.job
GROUP BY name, state
ORDER BY count DESC;

ErrorTrackingService

The error tracking service aggregates errors with fingerprinting:

Dashboard: errors grouped by fingerprint with occurrence count
Alert on new error types or spike in existing ones
Status flow: open → acknowledged → resolved