Daily ops tasks

This page explains how a day on the ops rotation works. In this role, our responsibility is to:

Responsibility	Why
Keep the service’s integrity	Data cannot be corrupted. (“Service” is the entire “Refer and monitor an intervention”, not just the API.)
Secure the service	Security issues must be fixed immediately. This includes any build failures because of security vulnerabilities.
Keep the service performant	Tolerances (CPU/memory/database storage/query performance) must be monitored. It cannot run out of pod limits, storage space or network bandwith.
Reduce alert noise	Only actionable, real problems should create alerts. Everything else should be a dashboard.

Check Grafana for any irregularities.
Check #interventions-alerts for any AlertManager (Kubernetes) errors.
Check #interventions-alerts for any exceptions.
Check kubectl get pods --namespace=hmpps-interventions-prod for any errors, backoffs, or excessive restarts.
Check the logs for any errors or problems.
🙋 Extend this documentation with more specific tips.

Look into why they happened. Use five whys.
Communicate all along the way in #interventions-dev: when you start looking, what you are looking into.

Resolve the issue with one of the below outcomes. Please update the thread with your conclusion so future people can find it.

Decision	Action
Not enough information	Consider what extra information we need to collect. Write new tickets if needed. Prioritise them if urgent.
Intermittent issue	Nothing to worry about right now. Escalate if occurs more times.
Serious issue	Establish how users are impacted and how many of them. If this is an outage, announce an incident (see below).
Noise	Not an issue, noise. Eliminate noise by improving the code and integrations. Ensure the problem is present on a dashboard. Write tickets if necessary.
Undetermined	Collaborate with others to determine an outcome. Extend this guide if necessary.

Announce that we are looking into it in #interventions with this template:

📟 Degraded service

From: {the time it started}

User impact: {explain the user impact in terms of what they cannot do and how many of them}

What: {explain what is happening}

Investigation in progress.
Raise an incident on the status page by “Edit status page” and “Create incident”.
Proceed to solve it and please keep the thread and incident updated with the progress.

When finished, please add the following lines to the original message:

✅ Resolved

Until: {the time it resolved}

This page was last reviewed on 7 September 2024. It needs to be reviewed again on 7 March 2025 by the page owner #interventions-dev .

This page was set to be reviewed before 7 March 2025 by the page owner #interventions-dev. This might mean the content is out of date.