Skip to main content

007 - Use Prometheus and Grafana for metrics and alerting

Date: 2020-07-07

Status

✅ Accepted

Context

There is a need to ensure PTTP systems e.g. DHCP, DNS and the networking equipment carrying those services are functioning and healthy.

The metrics and alerts for the services are available, but spread in multiple consoles making it hard for support engineers to get a overall view of the health of PTTP.

As more services from PTTP come online, we need a flexible monitoring solution which can consolidate the metric data.

Update October 2021 Amazon Managed Service for Promethues and Grafana are available. Ticket created to investigate here.

Update 7th January 2022 The IMA infrastructure can be moved to Cloud Platform, but requires the following issue (see issue here) resolved.

Decision

Use Prometheus for metrics and Grafana for visualisaton and alerting. - Aligned with wider MoJ teams - Promethues is lightweight, uses pull rather than push, can be containerised and run from a development machine. - Prometheus Exporters allow collection of metrics from network devices using SNMP, as well as the many native applications - Grafana to visualise a wide variety of sources. - Grafana can send notifications when a custom metric thresholds. Can be easily integrated into Slack (when availble ServiceNow) - Can be deployed into our existing CI/CD pipelines used for DHCP/DNS.

This page was last reviewed on 15 April 2024. It needs to be reviewed again on 15 October 2024 by the page owner #nvvs-devops .
This page was set to be reviewed before 15 October 2024 by the page owner #nvvs-devops. This might mean the content is out of date.