Made Open

Monitoring

Made Open ships with an optional monitoring overlay that adds Prometheus, Grafana, and Loki/Promtail for metrics and log aggregation.

Enabling Monitoring

Start the monitoring stack alongside the base services:

# Development
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d

# Production
docker compose -f docker-compose.prod.yml -f docker-compose.monitoring.yml up -d

To stop only the monitoring services:

docker compose -f docker-compose.yml -f docker-compose.monitoring.yml down

Accessing Grafana

Two data sources are auto-provisioned on first start:

Data SourceTypeURL
PrometheusMetricshttp://prometheus:9090
LokiLogshttp://loki:3100

Hub Metrics

The hub exposes two metrics endpoints:

GET http://localhost:4101/metrics              # JSON snapshot (MetricsService)
GET http://localhost:4101/metrics/prometheus   # Prometheus text exposition format

Prometheus scrapes /metrics/prometheus (see monitoring/prometheus.yml). The snapshot includes request counts, latencies, event bus throughput, job queue depth, cache hits/misses, and service health counters.

NATS Monitoring

NATS exposes several monitoring endpoints on port 8222:

EndpointDescription
/varzGeneral server information and stats
/connzActive connections
/routezRoute information
/subszSubscription details
/jszJetStream account and stream info

Prometheus scrapes /varz every 15 seconds (global scrape_interval: 15s in monitoring/prometheus.yml).

Meilisearch Stats

Meilisearch exposes index and performance statistics:

GET http://localhost:7700/stats

Prometheus scrapes this endpoint every 15 seconds.

Structured Logging

The hub uses pino for structured JSON logging. When the monitoring stack is running:

  1. Promtail reads container logs via the Docker socket
  2. JSON log lines are forwarded to Loki
  3. Query logs in Grafana using LogQL:
{service="hub"} | json | level="error"
{service="hub"} | json | eventType=~"data\\..*"
{container=~".*nats.*"}

Filter by service name, log level, event type, or any field in the structured log output.

Application-Level Metrics

Beyond infrastructure metrics, the hub exposes application-specific metrics via its services:

Rate Limiter Metrics (RateLimiterService)

The rate limiter tracks per-endpoint request windows:

MetricDescription
rate_limit_rejectionsCounter: requests rejected by the rate limiter

Per-policy stats (policy, bucketCount, remaining, resetsAt) are returned by GET /health/detailed under rateLimiter. Configured windows (see apps/hub/src/main.ts): api (100/min), webhook (500/min), agent (20/min), search (200/min), admin (30/min), auth (10 per 5min), oauth-token (20/min).

Circuit Breaker Status (CircuitBreakerService)

Three circuit breakers are registered by default:

BreakerFailure ThresholdTimeoutProtects
nats5 failures30sNATS JetStream connections
meilisearch3 failures15sSearch indexing and queries
openrouter5 failures60sLLM API calls

States: closed (healthy), open (rejecting), half_open (testing recovery). Per-breaker stats are included in the response of GET /health/detailed under circuitBreakers.

Cache Hit Rates (CacheService)

The in-memory LRU cache (TTL: 5min, max: 2000 entries) exposes:

MetricDescription
cache_hitsCounter: requests served from cache
cache_missesCounter: cache misses requiring fresh computation

Live size and eviction counts are returned by CacheService.getStats() and surfaced via GET /health/detailed under cache.

NATS Consumer Lag

Monitor JetStream consumer lag to detect processing bottlenecks:

# Check consumer lag via NATS CLI
nats consumer info platform-events hub-consumer

# Or via NATS monitoring endpoint
curl http://localhost:8222/jsz?consumers=true

Key metrics:

  • num_pending — Messages waiting to be delivered
  • num_ack_pending — Delivered but not yet acknowledged
  • num_redelivered — Messages redelivered after timeout

pg-boss Job Queue Depth

Monitor job queue health:

QueueLatency TargetWhat to Watch
realtime< 100msQueue depth > 10 = alert
interactive< 5sQueue depth > 50 = alert
backgroundMinutesQueue depth > 500 = investigate

pg-boss maintains its own monitoring tables in PostgreSQL. Query directly:

SELECT name, state, count(*)
FROM pgboss.job
GROUP BY name, state
ORDER BY count DESC;

ErrorTrackingService

The error tracking service aggregates errors with fingerprinting:

  • Dashboard: errors grouped by fingerprint with occurrence count
  • Alert on new error types or spike in existing ones
  • Status flow: openacknowledgedresolved