Skip to main content

Daily ops tasks

This page explains how a day on the ops rotation works. In this role, our responsibility is to:

Responsibility Why
Keep the service’s integrity Data cannot be corrupted. (“Service” is the entire “Refer and monitor an intervention”, not just the API.)
Secure the service Security issues must be fixed immediately. This includes any build failures because of security vulnerabilities.
Keep the service performant Tolerances (CPU/memory/database storage/query performance) must be monitored. It cannot run out of pod limits, storage space or network bandwith.
Reduce alert noise Only actionable, real problems should create alerts. Everything else should be a dashboard.

Checklist

  1. Check Grafana for any irregularities.

  2. Check #interventions-alerts for any AlertManager (Kubernetes) errors.

  3. Check #interventions-alerts for any exceptions.

  4. Check kubectl get pods --namespace=hmpps-interventions-prod for any errors, backoffs, or excessive restarts.

  5. Check the logs for any errors or problems.

  6. 🙋 Extend this documentation with more specific tips.

When we find anomalies

  1. Look into why they happened. Use five whys.

  2. Communicate all along the way in #interventions-dev: when you start looking, what you are looking into.

  3. Resolve the issue with one of the below outcomes. Please update the thread with your conclusion so future people can find it.

    Decision Action
    Not enough information Consider what extra information we need to collect.
    Write new tickets if needed. Prioritise them if urgent.
    Intermittent issue Nothing to worry about right now. Escalate if occurs more times.
    Serious issue Establish how users are impacted and how many of them.
    If this is an outage, announce an incident (see below).
    Noise Not an issue, noise. Eliminate noise by improving the code and integrations. Ensure the problem is present on a dashboard. Write tickets if necessary.
    Undetermined Collaborate with others to determine an outcome. Extend this guide if necessary.

When we have a user-impacting serious incident

  1. Announce that we are looking into it in #interventions with this template:

    📟 Degraded service

    From: {the time it started}

    User impact: {explain the user impact in terms of what they cannot do and how many of them}

    What: {explain what is happening}

    Investigation in progress.

  2. Raise an incident on the status page by “Edit status page” and “Create incident”.

  3. Proceed to solve it and please keep the thread and incident updated with the progress.

    When finished, please add the following lines to the original message:

    ✅ Resolved

    Until: {the time it resolved}

This page was last reviewed on 7 September 2024. It needs to be reviewed again on 7 March 2025 by the page owner #interventions-dev .
This page was set to be reviewed before 7 March 2025 by the page owner #interventions-dev. This might mean the content is out of date.