Skip to main content

Delayed Job Failures

Form Builder submissions are put onto a queue to be processed by the Submitter. Sometimes these can fail. They trigger a FailedDelayedJob alert which is sent to the #form-builder-alerts channel.

Find the correct environment

Sometimes the environment the alert ocurred in does not get outputted to Slack. You can check the Form Builder

Grafana dashboard
to see where it occurred. Failed Submitter Delayed Jobs is the top left box. Toggle the platform_env and deployment_env to move between environments.

Locate a Submitter API pod

Once you know which environment you need to check you can log into the required submitter api container:

kubectl get pods -n formbuilder-platform-<platform_env>-<deployment_env>

E.g

kubectl get pods -n formbuilder-platform-test-dev

Check the logs

Get the submitter api pod name from the list generated by the above commadn and then take a look at the logs from around the time of the alert:

kubectl logs -n formbuilder-platform-live-dev fb-submitter-api-live-dev-<identifier>

The identifier is generated when the pod is deployed so is different each time. The logs should hopefully give you an idea as to what happened.

Check failure message in the Delayed Job

You can also log into one of the Submitter API pods and take a look at the error message attached to the failed Delayed Job

kubectl exec -ti -n formbuilder-platform-test-dev fb-submitter-api-test-dev-<identifier>  -- bash

Then run a Rails console:

bundle exec rails c

You can see the jobs that are currently stuck on the queue:

Delayed::Job.all

or if to narrow it down to just todays failed jobs:

Delayed::Job.where(created_at: Time.zone.now.beginning_of_day..Time.zone.now.end_of_day)

The output will show you the number of retries and also show the error message generated when the job failed.

You can also see the stack trace of the error:

Delayed::Job.last.last_error

Try this first

There have been some instances where Kubernetes has not been successful in draining certain pods automatically. As a result of this a common side effect is that a submission might fail as the Submitter is unable to retrieve attachments from the filestore. The filestore responds with a 403 error.

If that is the error you are seeing with the failed job then it is worth restarting the pods first and seeing whether the queue naturally drains. You can get a list of the deployments for a given namespace by running:

kubectl get deployments -n formbuilder-platform-test-dev

Then to gracefully restart the pods you pass the name of the deployment. For example, for the deployment fb-user-filestore-test-dev:

kubectl rollout restart deployments -n formbuilder-platform-test-dev fb-user-filestore-test-dev

It is at least worth rolling the filestore pods. It may also be worth rolling the submitter pods too. If that still doesn’t solve the problem then it might need a manual reply and/or further investigation.

Removing a problem attachment

Sometimes attachments prevent a submission from being processed, most commonly this is due to an attachment with a mime type that is not recognised by our platform.

Unfortunately, we will need to remove the problem attachment to replay it, and inform the form owner of this.

Before removing the attachment, ensure you have the following information:

  • Some way to identify the submission (Submission ID or Case Number)
  • File name for the failed attachment
  • The form/service that was used to submit the form

Once we have this information, we can remove the attachment from the submission.

Save the decrypted submission to another variable:

updated_submission = Submission.find(submission_id).decrypted_submission

Find the problematic attachment in the ‘attachments’ key:

updated_submission['attachments']

Remove the attachment accordingly.

You can double check the attachments:

updated_submission['attachments']

Once you are happy that you have removed the problem attachment, you will need to update and re-encrypt the submission

And then you can replay the job

Once the job has been replayed successfully, we will need to inform the form owners so they can contact the end user about their attachment.

Updating and Re-encrypting the submission

Once you are happy that it is ok to replay the failed job you will need to re-encrypt the payload of the Submission related to the Delayed Job.

For example if the Delayed Job you need to replay is the last one:

First find the submission_id of the last Delayed Job, the following will provide the handler.

Delayed::Job.last.handler

You should see a key value pair in the handler that looks something like:

submission_id: 0c1bf45b-1e13-4cc4-ab0a-5bba20eb1b0f\n

Save the submission_id, not including the /n:

submission_id = "0c1bf45b-1e13-4cc4-ab0a-5bba20eb1b0f"

Find the submission and save this:

submission = Submission.find(submission_id)

The payload is encrypted. Unfortunately you will need to decrypt it before you can re-encrypt it again. At this point if you need to make any changes to the payload because it is invalid this would be the time.

submission.update(payload: SubmissionEncryption.new.encrypt(submission.decrypted_submission))

Replaying failed jobs

There are 3 rake tasks that can be run in order to replay Delayed Jobs. These will need to be run from outside the Rails console, but still on the Submitter container, in the normal way. One replays submissions without any attachments. So for a submission with the ID ‘12345’:

bundle exec rake replay_submission:process['12345']

Another is for replaying submissions which have attachments. The filestore only validates JWT’s that were created 60 seconds or less ago. So to replay any submissions with attachments after that we need to override the leeway using the rake task. So to override with a 8000 seconds leeway/timeout:

bundle exec rake replay_submission:with_attachments['12345',8000]

If the rake task fails, try increasing the leeway/timeout. The leeway needs to take into account the time of the submission, a leeway of 87000 will be enough to cover a 24hr window.

All being well you will no longer see that Delayed Job in the queue any longer. Great success!

If for some reason you are running these rake tasks using zsh you need wrap the rake command in quotes:

bundle exec rake 'replay_submissions:process[12345]'

Replaying failed HMCTS jobs

The HMCTS Adapter also has a queue where failed jobs sit. The adapter sits in its own namespace:

kubectl get pods -n hmcts-complaints-formbuilder-adapter-staging

And connect to a pod in the same way as the submitter:

kubectl exec -ti -n hmcts-complaints-formbuilder-adapter-staging hmcts-complaints-formbuilder-adapter-api-staging-<identifier> -- bash

You can then run a rake task to list the Submission IDs of the failed jobs:

bundle exec rake list_failed_submission_ids

These Submission IDs are the payload submission IDs that are generated by the Runner and stored in the JSON blob of the Submitter database. However the Submitter thinks of Submission IDs as the ID of the row of the table in the Submitter database.

Therefore to replay Submissions from the Submitter the task below needs to match the IDs passed in to ones inside the JSON payload:

bundle exec rake replay_hmcts_adapter_submission:process[12345]

This re-submits each Submission as a brand new Job to the HMCTS Adapter. In which case you should also delete the old ones on the Delayed::Job queue of the Adapter. The list_failed_submission_id also prints out the Job ID so you can use that to run: Delayed::Job.delete(job_id) from inside the Rails console. Carefully, of course!

This page was last reviewed on 30 January 2023. It needs to be reviewed again on 30 April 2023 .
This page was set to be reviewed before 30 April 2023. This might mean the content is out of date.