Skip to content

improvement(scheduler): raise per-tick claim budget to drain backlog#4567

Merged
TheodoreSpeaks merged 2 commits into
stagingfrom
investigate/schedule-fail
May 12, 2026
Merged

improvement(scheduler): raise per-tick claim budget to drain backlog#4567
TheodoreSpeaks merged 2 commits into
stagingfrom
investigate/schedule-fail

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

@TheodoreSpeaks TheodoreSpeaks commented May 12, 2026

Summary

  • MAX_CRON_CLAIMS 20 → 200; reserved workflow/job slots 10/10 → 100/100.
  • The scheduler picker was capped at 20 due items per tick. Once steady-state demand crossed ~20 schedules/min the picker fell behind and never caught up — we saw 18+ hours of sustained Processing 20 due items ticks with a chronic backlog. A workflow scheduled for midnight today only fired at 11:32 UTC because that's when its row finally reached the head of the queue.
  • Per-item processing is ~500ms, so 100 items still fits comfortably inside one 60s cron tick (~50s).

Type of Change

  • Bug fix

Testing

  • bun run lint — clean
  • bun run check:api-validation:strict — passes
  • No behavior change other than the constant bump; existing claim/lock semantics unchanged.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

MAX_CRON_CLAIMS 20 -> 100; reserved workflow/job slots 10/10 -> 50/50.
Throughput was capped at 20 schedules/tick which created a 20+ hour
backlog when due work exceeded ~1 item per cron-second.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 12, 2026 7:57pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 12, 2026

PR Summary

Medium Risk
Low code risk (constant-only change), but raising the per-tick claim limits can increase DB locking/queue pressure and runtime per cron tick under heavy backlog.

Overview
Increases the scheduler execution endpoint’s per-cron claim budget by 10×, raising MAX_CRON_CLAIMS to 200 and expanding the reserved workflow/job claim split (workflow reservation to 100, with the remainder for jobs).

This allows each cron tick of apps/sim/app/api/schedules/execute/route.ts to dequeue and queue significantly more due work to help drain sustained scheduling backlogs, without changing the underlying claim/lock logic.

Reviewed by Cursor Bugbot for commit 4591c1b. Bugbot is set up for automated code reviews on this repo. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR addresses a chronic scheduler backlog by raising MAX_CRON_CLAIMS from 20 to 100 and RESERVED_WORKFLOW_CLAIMS/RESERVED_JOB_CLAIMS from 10 to 50 each, so the picker can drain more than ~20 due items per tick.

  • MAX_CRON_CLAIMS 20 → 100; both reserved-slot constants scale proportionally (10 → 50 each), preserving the 50/50 split.
  • The overflow reallocation logic (lines 162–166), lock/claim semantics, and DB transaction patterns are all unchanged — only the ceiling values differ.
  • Workflow schedules fan out to the async job queue (fast path), but Mothership jobs are always executed inline; that pool now reaches 50 concurrent in-process executions per tick instead of 10.

Confidence Score: 4/5

Safe to merge for the workflow-schedule path; the inline-job path deserves a quick operational check at the new concurrency level before relying on it under full load.

The change touches two integer constants and leaves every lock, claim, transaction, and overflow-reallocation path intact. The only non-trivial risk is the 5x increase in concurrent inline job executions per tick — if those jobs are DB- or network-heavy, a busy tick could see noticeably more resource pressure than before, and there is no observability or back-pressure in place to catch a latency cliff.

apps/sim/app/api/schedules/execute/route.ts — specifically the inline job execution block (lines 316–338) and the executeJobInline implementation it calls.

Important Files Changed

Filename Overview
apps/sim/app/api/schedules/execute/route.ts Raises MAX_CRON_CLAIMS 20→100 and RESERVED_WORKFLOW_CLAIMS 10→50 (RESERVED_JOB_CLAIMS derived at 50); overflow budget reallocation logic and claim/lock semantics are unchanged; inline job concurrency scales 5× with no back-pressure guard.

Sequence Diagram

sequenceDiagram
    participant Cron as Cron Trigger (60s tick)
    participant Handler as /api/schedules/execute
    participant DB as Database (FOR UPDATE SKIP LOCKED)
    participant Queue as Async Job Queue
    participant Inline as executeJobInline

    Cron->>Handler: GET (authenticated)
    Handler->>DB: "claimWorkflowSchedules(limit=50)"
    DB-->>Handler: up to 50 workflow rows (locked)
    Handler->>DB: "claimJobSchedules(limit=50)"
    DB-->>Handler: up to 50 job rows (locked)
    Note over Handler: remainingClaimBudget = 100 - claimed
    par Workflow schedules up to 50
        Handler->>Queue: enqueue schedule-execution payload
        Queue-->>Handler: jobId
    and Mothership jobs up to 50 always inline
        Handler->>Inline: executeJobInline(payload) x up to 50 concurrent
        Inline-->>Handler: done
    end
    Handler-->>Cron: executedCount N
Loading

Comments Outside Diff (1)

  1. apps/sim/app/api/schedules/execute/route.ts, line 316-338 (link)

    P2 Inline job concurrency increase: 10 → 50

    Mothership jobs skip the queue and always execute inline via executeJobInline. With RESERVED_JOB_CLAIMS raised from 10 to 50, up to 50 of these can now fire concurrently inside a single tick via Promise.allSettled. If executeJobInline opens DB connections, issues HTTP calls, or holds locks, that 5× jump in concurrent in-process job work could create resource pressure — especially since the handler has a maxDuration of 3600s and there is no back-pressure mechanism here. Worth adding observability (e.g. timing the allSettled block) to catch any latency cliff before the next scale-up.

Reviews (1): Last reviewed commit: "improvement(scheduler): raise per-tick c..." | Re-trigger Greptile

Bumps MAX_CRON_CLAIMS 100 -> 200 (workflow/job split 100/100). Pairs
with the fire-and-forget cron Lambda change so per-tick processing
time is no longer bounded by the Lambda's 50s HTTP timeout.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4591c1b. Configure here.

Comment thread apps/sim/app/api/schedules/execute/route.ts
@TheodoreSpeaks TheodoreSpeaks merged commit 05892f7 into staging May 12, 2026
14 checks passed
@TheodoreSpeaks TheodoreSpeaks deleted the investigate/schedule-fail branch May 12, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant