How can we help?

Monitoring & Observability

What metrics do you collect and how is the platform monitored?

Every node in the cluster runs a Prometheus node exporter and a Datadog agent. The kube-prometheus-stack continuously collects:

  • Node-level: CPU, memory, disk I/O, network throughput
  • Kubernetes: pod health, deployment status, PVC usage, API server latency, kubelet metrics
  • Application (per dyno): CPU %, memory % against allocated limits, over configurable rolling windows
  • Addon-level:
    • PostgreSQL: active connections, transaction commits, cache hit ratio, database size
    • Elasticsearch: query rate, fetch time, indexing time, JVM memory used, cluster health, document count
    • Redis: memory usage, connections

Grafana dashboards provide real-time visibility into all of the above.

How does alerting work?

Honeybadger monitors application errors and infrastructure health. Alerts are routed to PagerDuty (for on-call paging) and Slack (for team visibility). The Deploy team is on-call for infrastructure incidents.

Users can also set their own consumption alerts — configurable CPU, memory, and storage thresholds per app or addon — with email, Slack, or PagerDuty notifications at frequencies from every 5 minutes to weekly.