Disaster recovery
This runbook outlines the steps to take in the event of a major incident and links to individual product recovery
General incident Scenarios
Types of incident that would be considered major are:
- Complete loss of a product
- Loss of an AWS account
- Loss of networking
Other failures which will not affect the availability of an application, but will impact a teams ability to deploy:
- Loss of Terraform state
- Loss of access
- Loss of AWS CodePipelines and Github Actions
- Loss of Github
Identification of an incident
This may come from various sources, users, alerts or other.
Gather information
Gather all the information on the incident that you can and record it.
- Who is affected?
- What is affected?
- What is the impact?
- When did this start?
- Were there any changes recently? (link to pull request or commit)
- Are there any AWS health dashboard issues
Communicate with the team
Post in the #ask-nvvs-devops channel using @nvvs-devops-team
to make all team members aware.
Communicate regularly with users, even if there is no resolution to keep them aware of progress.
If the incident is a security incident report it to the security team.
Log a support ticket
If the incident cannot be resolved within the team or if the issue lies with a 3rd party log a support ticket with the 3rd party. For AWS support, log a call in the AWS account affected.
3rd Party | How to log a support ticket | Escalation process |
---|---|---|
AWS | Creating a support case | On the case, or post in #ext-awssupport |
Resolution
- When the incident is resolved let users know.
- Arrange a time for an incident retro so that any lessons from the incident can be learned.
Product recovery
This section includes information specific to recovering each of our products.
DHCP DNS recovery
This repo contains an interactive script which can be used to roll back a corrupt config file for the DNS or DHCP services.
Network Access Control recovery
This repo contains an interactive script which can be used to roll back a corrupt config or container version for the Network Access Control service.
SMTP Relay recovery
This repo contains information to recover the SMTP relay service