Skip to main content

Disaster recovery

This runbook outlines the steps to take in the event of a major incident and links to individual product recovery

General incident Scenarios

Types of incident that would be considered major are:

  • Complete loss of a product
  • Loss of an AWS account
  • Loss of networking

Other failures which will not affect the availability of an application, but will impact a teams ability to deploy:

  • Loss of Terraform state
  • Loss of access
  • Loss of AWS CodePipelines and Github Actions
  • Loss of Github

Identification of an incident

This may come from various sources, users, alerts or other.

Gather information

Gather all the information on the incident that you can and record it.

  • Who is affected?
  • What is affected?
  • What is the impact?
  • When did this start?
  • Were there any changes recently? (link to pull request or commit)
  • Are there any AWS health dashboard issues

Communicate with the team

Post in the #ask-nvvs-devops channel using @nvvs-devops-team to make all team members aware.

Communicate regularly with users, even if there is no resolution to keep them aware of progress.

If the incident is a security incident report it to the security team.

Log a support ticket

If the incident cannot be resolved within the team or if the issue lies with a 3rd party log a support ticket with the 3rd party. For AWS support, log a call in the AWS account affected.

3rd Party How to log a support ticket Escalation process
AWS Creating a support case On the case, or post in #ext-awssupport

Resolution

  • When the incident is resolved let users know.
  • Arrange a time for an incident retro so that any lessons from the incident can be learned.

Product recovery

This section includes information specific to recovering each of our products.

DHCP DNS recovery

This repo contains an interactive script which can be used to roll back a corrupt config file for the DNS or DHCP services.

Network Access Control recovery

This repo contains an interactive script which can be used to roll back a corrupt config or container version for the Network Access Control service.

SMTP Relay recovery

This repo contains information to recover the SMTP relay service

This page was last reviewed on 18 April 2024. It needs to be reviewed again on 18 October 2024 by the page owner #nvvs-devops .
This page was set to be reviewed before 18 October 2024 by the page owner #nvvs-devops. This might mean the content is out of date.