Alerts Quick Guide
Most of the information can be found in this Runbook
This is a quick guide about the common alerts you may come across in the #form-builder-alerts channel.
Alerts in this channel are high severity and affect live environment. Whereas alerts in the low severity channel concerns the test environment.
Application Related Alerts
Failed Delayed Jobs
This is something that needs to be investigated. This issue can occur for submitter or the editor.
A walk through of how to investigate for the submitter is found in this documentation.
For the editor, the guide can be found here.
Client Error Response
These usually resolve within 5 minutes of firing. If there are a number (say 5 or more) of these alerts in short succession, it could be a bot trying to get into the service.
Slow Responses
These usually resolve quickly after firing with no intervention needed.
Kubernetes Related Alerts
Kube Namespace Quota Nearing
We have an alert that lets us know when we hit 80%. If this goes off it could be because the pods are creating and draining, which is normal. To check this run:
kubectl get pods -n <namespace>
This will give you all the pods in a namespace. There should be 2 for each form, if there are some that are restarting then you may have more than 2 - which could be why the quota has been reached. If this is the case, then there is no issue. If there are no pods restarting in the list, the best to chat with the team about whether we need to increase the number of pods.
Kube Pod Crash Looping
Each form has 2 pods, so the form should still be running despite a pod being down. Check the form, then check the logs. A way to fix this could be to restart the problem pod.
Documentationon checking the logs and restarting the pods.
Namespace Missing
To check whether the namespace is missing:
kubectl get namespace | grep < namespace >
If the namespace is there, then there isn’t much to worry about. If it doesn’t resolve itself and the namespace is there, it might be worth flagging with cloud platform. If the namespace is in fact not there, then we have a problem and will need to let the other devs know.
Kube Pod Not Ready
There are always 2 pods for every form, as such this alert is generally nothing to worry about. This should resolve itself once the pod is up and running.
Testable Editors
Sometime we will see a ‘Kube Pod Not Ready’ error for a testable editor pod.
When an editor branch has a prefix of testable
every push on that branch creates a web and worker pod.
If a testable editor branch is merged into main, the deployment pipeline will trigger a deletion of the testable branch and the associated pods.
However, sometimes testable branches may not need to be merged in and are deleted without being merged, which leaves us with testable editor pods that may trigger alerts.
If a testable editor pod is causing alerts and the testable branch is stale or has been deleted, we will need to manually remove the deployment resource.
To check what deployment resources are running:
kubectl get deployments -n <namespace>
For each testable editor branch, there will be an associated web and worker deployment, for example:
NAME READY UP-TO-DATE AVAILABLE AGE
testable-my-branch-web-test 2/2 2 2 5d
testable-my-branch-workers-test 2/2 2 2 5d
Once we are happy that the testable editor is no longer being used, we can remove both the web and worker deployment resources.
kubectl delete deployments -n <namespace> <deployment>
For example:
kubectl delete deployments -n formbuilder-saas-test testable-my-branch-web-test
Sentry related alerts
All Sentry alerts are found on the Sentry dashboard. Please ensure the alert is resolved once you have completed your investigation, otherwise we will not be alerted for the unresolved error in future.
Faraday::Unprocessable Entity Error
This is the metadata-api returning a 422, this happens when either:
- The user has tried to create a page with a URL that already exists
- The editor is trying to send metadata to the API that it shouldn’t be
If the issue is the former, then that is not anything we need to investigate.
However, if it is due to the Editor sending metadata it should not be sending, we will need to investigate the issue further.
Action View Template Error
This could be due to a form owner interacting with a form that was made prior to breaking changes were introduced. This is not a serious issue, we may need to ask the form owner to re-publish a form to get the latest changes.
Net Read Timeout
This usually occurs on the HMCTS complaints adapter. Generally it is a communication error with Optics, it is best to keep an eye out for any ‘Failed Delay Jobs’ alerts as this will point to the job not being processed. We can also check if there any delayed jobs on the api pods of the hmcts-complaints-formbuilder-adapter-production
namespace by running Delayed::Job.count
in a rails console.
There is a Sentry alert - how do I find out which form is causing the error?
When a Sentry alert is triggered, general information regarding the alert can be accessed via the Sentry dashboard. There will be a URL that is provided in the Sentry Issue, this will have the service ID in the path.
We can search for this service ID in the Admin feature of the Editor. Once you have identified the service in the Admin dashboard you will be able to pinpoint the form and user that relate to the issue. If required, you can copy the form metadata to debug further.