Monitoring

Uptime

Monitors and notifies us when a component no longer appears to be working.

We define Pingdom uptime alerts in our Cloud Platform namespace. Routed to #interventions.
We deploy uptime-kuma via our operational scripts, routed to #interventions-alerts. The login credentials are in cloud-platform decode-secret -s uptime-monitor-login -n hmpps-interventions-prod.

Captures and notifies us about unexpected errors in the running applications.

Infrastructure metrics monitor resource usage: CPU, memory, bandwidth, latency, storage space, etc.

We use Cloud Platform’s observability stack.

Prometheus allows us to query live metrics directly.
Thanos allows us to query historical metrics.
We have our own Grafana dashboard, defined here.
Alert definitions allow us to configure AlertManager on any metric available in Prometheus. These go to #interventions-alerts.

Monitoring the health of services that we do not operate.

Request access to AppInsights here.

Azure Application Insights supports cross-application request tracing.
uptime-kuma monitors the availability of our dependencies.
Grafana dashboard has panels about 99th and 100th percentile response times from API to other services.
The dependency fail AppInsights chart tracks how often we fail to receive a response from a dependency.
The dependency spread AppInsights query tracks which endpoint fails with which dependency.

Application metrics focus on throughput and performance, while business metrics focus on significant user behaviour (e.g. pivotal events).

Spring Boot can expose business metrics via the micrometer library (example). These will be available in both Azure Application Insights and Prometheus.

Request access to AppInsights here.

Warning hmpps-interventions-ui does not (yet) expose metrics to Prometheus.

Azure Application Insights also ingests all Node and Java/Kotlin application metrics.
Prometheus allows us to query live metrics directly.
Thanos allows us to query historical metrics.
Grafana dashboard has panels about database transactions waiting, slowest API requests, garbage collection time.
Alert definitions allow us to configure AlertManager on application metric anomalies.

All our applications log to stdout in containers, which are then centrally collected.

Kibana pod logs contain logs from applications, one-off jobs, and everything that runs in pods.
Kibana ingress logs contain all HTTP requests and responses received by the ingress in front of our applications. These logs also have modsecurity (Web Application Firewall) entries for blocked request (HTTP 423 status) investigations.

This page was last reviewed on 7 September 2024. It needs to be reviewed again on 7 March 2025 by the page owner #interventions-dev .

This page was set to be reviewed before 7 March 2025 by the page owner #interventions-dev. This might mean the content is out of date.