Stalled Job Detection
What are Stalled Jobs?
Section titled “What are Stalled Jobs?”Stalled jobs occur when a worker crashes or loses connection while processing a job. Without detection and recovery, these jobs would remain stuck in the processing state indefinitely, blocking the queue.
How It Works
Section titled “How It Works”- Background Checker: Runs every
stalledInterval(default: 30 seconds) - Detection: Finds jobs in processing state past their deadline + grace period
- Recovery: Moves stalled jobs back to waiting state for retry
- Failure: After
maxStalledCountstalls, jobs are permanently failed
Configuration
Section titled “Configuration”const worker = new Worker({
queue,
handler: async (job) => { /* ... */ },
// Stalled job detection
stalledInterval: 30000, // Check every 30 seconds
maxStalledCount: 1, // Fail after 1 stall
stalledGracePeriod: 0, // No grace period (use for clock skew)
})
Monitoring
Section titled “Monitoring”Listen to the stalled event to monitor and alert on stalled jobs:
worker.on('stalled', (jobId, groupId) => {
console.warn(`Job ${jobId} from group ${groupId} was stalled`)
// Alert your monitoring system
metrics.increment('jobs.stalled', { groupId })
// Note: Job has already been recovered automatically
})
When Jobs Become Stalled
Section titled “When Jobs Become Stalled”Jobs become stalled in production when:
- Worker process crashes (SIGKILL, OOM, etc.)
- Worker loses Redis connection permanently
- Worker hangs indefinitely in user code
- Server/container is terminated abruptly
Configuration Recommendations
Section titled “Configuration Recommendations”Default (Most Use Cases):
{
stalledInterval: 30000, // 30 seconds
maxStalledCount: 1, // Fail after 1 stall
stalledGracePeriod: 0, // No grace period
}
High Reliability:
{
stalledInterval: 15000, // Check more frequently
maxStalledCount: 2, // Allow 2 stalls before failing
stalledGracePeriod: 5000, // 5s grace for clock skew
}
Low Overhead:
{
stalledInterval: 60000, // Check every minute
maxStalledCount: 1, // Fail fast
stalledGracePeriod: 0, // No grace period
}
Manual API
Section titled “Manual API”You can also manually check for stalled jobs:
// Check for stalled jobs manually
const result = await queue.checkStalledJobs()
console.log(`Found ${result.stalled} stalled jobs`)
Best Practices
Section titled “Best Practices”Production Checklist
Section titled “Production Checklist”✅ Enable stalled job detection with appropriate intervals
✅ Monitor the stalled event for alerting
✅ Keep maxStalledCount low (1-2) for faster failure detection
✅ Test worker crashes in staging to verify recovery
Error Handling Pattern
Section titled “Error Handling Pattern”const worker = new Worker({
queue,
handler: async (job) => {
try {
// Your job logic
return await processJob(job.data)
} catch (error) {
// Application errors will be retried according to maxAttempts
throw error
}
},
// Stalled job detection
stalledInterval: 30000,
maxStalledCount: 1,
maxAttempts: 3,
})
// Monitor stalled jobs
worker.on('stalled', (jobId, groupId) => {
logger.warn(`Job ${jobId} stalled`)
metrics.increment('jobs.stalled')
})
worker.on('failed', (job) => {
logger.error(`Job ${job.id} failed permanently`, {
attempts: job.attemptsMade,
reason: job.failedReason,
})
})
Troubleshooting
Section titled “Troubleshooting”Jobs Keep Stalling
Section titled “Jobs Keep Stalling”Symptoms: High stalled event frequency
Possible Causes:
- Workers crashing frequently - check logs
jobTimeoutMstoo short - increase timeout- Long-running jobs - optimize or increase timeout
- Memory issues - check for leaks
Solutions:
- Increase
jobTimeoutMsfor long operations - Increase
stalledGracePeriodif clock skew is an issue - Fix memory leaks causing worker crashes
- Scale horizontally with more workers
Jobs Timeout During Processing
Section titled “Jobs Timeout During Processing”Symptoms: Jobs fail with timeout even though processing
Possible Causes:
- Heartbeat not running (blocking code)
jobTimeoutMstoo short- Network issues preventing heartbeat
Solutions:
- Avoid blocking the event loop in handler
- Increase
jobTimeoutMs - Optimize slow database queries
- Use async operations properly