Re-run stuck jobs
Observed behaviour
We receive a KubeJobFailed
alert from AlertManager:
Checking the logs, we see a recoverable error:
$ kubectl logs --selector=job-name=db-refresh-job-27744960 --tail=-1
{snip}
pg_dump: [archiver (db)] connection to database "snip" failed: ⏎
could not translate host name "snip" to address: ⏎
Temporary failure in name resolution
Desired behaviour
We decide the error is intermittent, e.g. hostname resolution problems, and want to re-run the job.
Actions
First, get the job name:
$ kubectl get jobs --show-kind
NAME COMPLETIONS DURATION AGE
job.batch/db-refresh-job-27744960 0/1 7h9m 7h9m
Then, re-create the job with the same specs:
$ kubectl get job.batch/db-refresh-job-27744960 -o json | \
jq 'del(.spec.selector)' | \
jq 'del(.spec.template.metadata.labels)' | \
kubectl replace --force -f -
job.batch "db-refresh-job-27744960" deleted
job.batch/db-refresh-job-27744960 replaced
This page was last reviewed on 5 March 2025.
It needs to be reviewed again on 5 March 2026
by the page owner #interventions-dev
.
This page was set to be reviewed before 5 March 2026
by the page owner #interventions-dev.
This might mean the content is out of date.