Daily ops tasks
This page explains how a day on the ops rotation works. In this role, our responsibility is to:
Responsibility | Why |
---|---|
Keep the service’s integrity | Data cannot be corrupted. (“Service” is the entire “Refer and monitor an intervention”, not just the API.) |
Secure the service | Security issues must be fixed immediately. This includes any build failures because of security vulnerabilities. |
Keep the service performant | Tolerances (CPU/memory/database storage/query performance) must be monitored. It cannot run out of pod limits, storage space or network bandwith. |
Reduce alert noise | Only actionable, real problems should create alerts. Everything else should be a dashboard. |
Checklist
Check Grafana for any irregularities.
Check #interventions-alerts for any AlertManager (Kubernetes) errors.
Check #interventions-alerts for any exceptions.
Check
kubectl get pods --namespace=hmpps-interventions-prod
for any errors, backoffs, or excessive restarts.Check the logs for any errors or problems.
🙋 Extend this documentation with more specific tips.
When we find anomalies
Look into why they happened. Use five whys.
Communicate all along the way in #interventions-dev: when you start looking, what you are looking into.
Resolve the issue with one of the below outcomes. Please update the thread with your conclusion so future people can find it.
Decision Action Not enough information Consider what extra information we need to collect.
Write new tickets if needed. Prioritise them if urgent.Intermittent issue Nothing to worry about right now. Escalate if occurs more times. Serious issue Establish how users are impacted and how many of them.
If this is an outage, announce an incident (see below).Noise Not an issue, noise. Eliminate noise by improving the code and integrations. Ensure the problem is present on a dashboard. Write tickets if necessary. Undetermined Collaborate with others to determine an outcome. Extend this guide if necessary.
When we have a user-impacting serious incident
Announce that we are looking into it in #interventions with this template:
📟 Degraded service
From: {the time it started}
User impact: {explain the user impact in terms of what they cannot do and how many of them}
What: {explain what is happening}
Investigation in progress.
Raise an incident on the status page by “Edit status page” and “Create incident”.
Proceed to solve it and please keep the thread and incident updated with the progress.
When finished, please add the following lines to the original message:
✅ Resolved
Until: {the time it resolved}