Stale Lock Reaper
When a worker crashes mid-job, the job stays in “running” status with no one to complete or fail it. The stale lock reaper is a background mechanism that detects these abandoned jobs and resets them so another worker can pick them up.
How Job Locking Works
Every time a worker dequeues a job, it acquires an exclusive lock on the job record in the database:
- The storage layer sets
LockedByto the worker’s unique ID andLockedUntilto 45 minutes from now. - While the job is executing, the worker sends a heartbeat every 2 minutes
that extends
LockedUntilby another 45 minutes. This keeps the lock fresh for long-running jobs without requiring an excessively long initial lock window. - When the job completes or fails, the worker clears the lock fields.
If a worker crashes before it can clear the lock, the LockedUntil timestamp
eventually lapses and the job becomes reclaimable.
The Stale Lock Reaper
Each worker starts a background goroutine called the stale lock reaper. It performs the following cycle:
- Tick – wake up on a configurable interval (default: every 5 minutes).
- Scan – query the database for jobs whose
statusisrunningand whoseLockedUntiltimestamp is older than the current time minusStaleLockAge(default: 45 minutes). - Reset – set those jobs back to
pending, clearLockedByandLockedUntil, so another worker can dequeue them. - Log – if any jobs were reclaimed, emit a structured log line:
INFO released stale running jobs count=NBecause every worker runs its own reaper, the cluster self-heals even if only one worker remains online.
Configuration
Both tuning knobs are set through worker options:
worker := queue.NewWorker(
jobs.WithStaleLockInterval(5 * time.Minute), // How often to check (default: 5min)
jobs.WithStaleLockAge(45 * time.Minute), // Lock expiry threshold (default: 45min)
)Disabling the Reaper
If you run a single worker and prefer to handle stale jobs through external monitoring, you can disable the reaper entirely:
worker := queue.NewWorker(
jobs.WithStaleLockInterval(0), // Disable the stale lock reaper
)Choosing Values
| Parameter | Default | Guidance |
|---|---|---|
WithStaleLockInterval | 5 min | Lower values detect stale jobs faster but add more database queries. For high-throughput clusters, 2-3 minutes is reasonable. |
WithStaleLockAge | 45 min | Must be at least as large as the lock duration (45 min) so that a lock that just expired is not immediately reaped before the heartbeat has a chance to extend it. Increase this if your jobs routinely run for hours and you want extra safety margin. |
The default 45-minute age matches the lock duration set by the storage layer. During normal operation heartbeats extend locks every 2 minutes, so an active job’s lock is always well within the 45-minute window and will never be reclaimed by the reaper.
When the Reaper Helps
The reaper is your safety net against several failure modes:
- Worker process crash or SIGKILL – the process is gone, no cleanup runs. The lock expires naturally and the reaper resets the job.
- Network partition between worker and database – the worker cannot send heartbeats, so the lock expires. Once the partition heals, the reaper (running on any healthy worker) reclaims the job.
- Long GC pause or resource starvation – if a worker is paused by the operating system long enough for the lock to expire, the reaper on a different worker can reclaim the job.
In all of these cases the job returns to pending status and will be retried by
the next available worker, preserving any checkpoints that were saved before the
failure.
Interaction with Heartbeats and Retries
The heartbeat, lock, and reaper work together as a layered reliability mechanism:
Heartbeat interval: 2 min (keeps lock fresh)
Lock duration: 45 min (initial window before heartbeat required)
Reaper age: 45 min (how stale before reclaim)
Reaper interval: 5 min (how often we check)A job can only be reclaimed if all of the following are true:
- Its status is
running. - Its
LockedUntilis at leastStaleLockAgein the past. - No heartbeat has extended the lock in that window.
When a reclaimed job is dequeued again, its Attempt counter increments
normally. If it has already exhausted MaxRetries, the next failure will mark
it as permanently failed.