Skip to main content

ALZ Incident Management Approach

This page sets out a general process for incident management in the ALZ team, and defines SLAs and process flows to ensure incidents are managed consistently

Service Level Agreement

The ALZ Team expects that in the course of normal working, incidents will be resolved on the following timescales:

Major/extensive incidents: Resolution or escalation within 1 day, updates at minimum every 4 working hours/half a day

Significant incidents: Resolution or escalation within 3 days, updates at minimum daily

All other incidents: Resolution or escalation within 5 days, updates at minimum every 2 working days

Incident types

The ALZ team may develop their own incident management process over time but for now, we will be using the PTTP definitions.

A more ‘plain English’ description of how the ALZ team will define incidents is as follows:

Major/extensive

Critical incidents would usually be any incident where:

• There is risk of unauthorised non-MoJ access to sensitive data

• There is unresolved unscheduled downtime in a system used by more than ~100 users/day

• There is a large potential cost implication

When resolving critical incidents, the incident comms manager should update the reporting party and any relevant stakeholders at minimum every 4 working hours/half a day

Significant

Severe incidents would usually be any incident where:

  • There is a risk of unauthorised access by users within the MoJ to sensitive data

  • There is unresolved unscheduled downtime in a system used by more than ~50 users/day

  • There is a medium potential cost implication

When resolving critical incidents, the incident comms manager should update the reporting party and any relevant stakeholders at minimum every working day.

Moderate

Moderate incidents would usually be any incident where:

  • A resource is unavailable or behaving anomalously

  • Other general incidents

When resolving moderate incidents, the incident comms manager should update the reporting party and any relevant stakeholders at minimum every 3 working days.

Incident management process

The ALZ team is aiming for a simplified process that meets the team’s current needs and can be expanded as required.

When something that might be an incident occurs, or we are told about a problem, we…

Determine whether it’s an incident

If we think it’s an incident, we will create an dedicated ALZ Incident channel (which will be the name of the incident) in the existing Azure Landing Zone Teams. If unsure, consult other team members, and take into account the view of any available stakeholders.

Log the incident

If it’s not an incident, handle the problem another way. If it has been reported to ALZ as an incident but we don’t think it is one, log the non-incident in the Teams channel for visibility and clarity.

If it is an incident, log it in the Teams channel. Use your judgement on what information is most relevant, but it’s always worth logging:

  • When the incident was reported/discovered

  • The discoverer/reporter’s contact details

  • Steps to replicate, if known

The post in Teams about the incident should begin a thread about the incident to which further updates can be added.

Triage the incident

Estimate how large the incident is, who might be involved in resolving it, and whether the team need to do any initial communication about it.

Decide who will work on the incident

This might be any member of the team, but is most likely the person who declared and/or logged the incident. For larger incidents, this may be more than one person.

The primary tasks in handling the incident are communication and resolution - both are usually necessary to prevent the problem happening again. Somebody should be responsible for each of these tasks.

For a smaller incident, this might typically be one person. For a larger incident a coordinator should ensure that communication happens both between the people working on resolving the incident and to any necessary stakeholders. Somebody, typically whoever is leading on communications, should post these details in the incident thread.

Resolve the incident

The incident handling cycle should include resolution and communication.

If you are handling the incident, ensure you are frequently communicating with the team and/or stakeholders as appropriate. If you have to step away from the incident there should be sufficient detail in your notes that somebody else could pick it up in your place.

Typical aspects of resolution include:

  • Note down steps taken: If you are working on resolving an incident, keep a running log of the steps you have taken. This isn’t for the customer, but for your fellow engineers should you need to step away or should this need to be replicated.

  • Communication: If you are managing communication about an incident, check in regularly enough to make sure you know what everyone working on the incident is doing. Find out if there are stakeholders you need to notify about the incident, and keep them informed regularly.

  • Resolution: Fix the problem!

  • Escalation: If there are problems in resolving the incident, ask for help within the wider ALZ team. Ensure that if the problem needs to be escalated, that this happens. It’s sensible to check in about the need to escalate at least once a day.

Post-incident review

There should always be a post-incident review. This doesn’t need to be formal, and it’s important that it’s never about blame. This is an opportunity to review what happened and see if the team can make any changes to prevent further incidents.

This review should be:

  • Within a week of the incident (so it’s fresh in people’s minds!)

  • Involve everyone who worked on the incident

  • Take everyone’s input as an equal

  • Initial review should be approx 30mins - but less is fine!

  • Written down after the meeting!

This page was last reviewed on 3 February 2023. It needs to be reviewed again on 3 May 2023 .
This page was set to be reviewed before 3 May 2023. This might mean the content is out of date.