improvement(scheduler): drain due schedules in chunks instead of a single capped claim#4578
improvement(scheduler): drain due schedules in chunks instead of a single capped claim#4578TheodoreSpeaks wants to merge 2 commits into
Conversation
Replaces the fixed MAX_CRON_CLAIMS (200) with a chunked drain loop: claim WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE per iteration, process via Promise.allSettled, repeat until both claim queries return empty or MAX_TICK_DURATION_MS elapses. Throughput is no longer bounded by a static per-tick ceiling; it scales until DB or trigger.dev is the limit. Per-iteration chunk size still bounds row-lock set and fan-out concurrency. Extracts processScheduleItem and processJobItem so the loop body stays readable. Existing claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt as the claim signal, staleness reclaim) are unchanged.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Refactors per-item handling into Reviewed by Cursor Bugbot for commit 1fb247b. Bugbot is set up for automated code reviews on this repo. Configure here. |
Greptile SummaryThis PR replaces the previous fixed
Confidence Score: 4/5Safe to merge — the drain loop correctly replaces the static cap, claim semantics are unchanged, and concurrent invocations remain safe via SKIP LOCKED. The refactoring is logically sound and all existing tests pass. The two flagged items are optimization opportunities rather than correctness problems: the loop issues unnecessary DB calls once one queue is exhausted, and the workflowUtils non-null assertion relies on an implicit invariant that holds today but is not enforced by the type system. apps/sim/app/api/schedules/execute/route.ts — specifically the while-loop claim strategy and the workflowUtils! assertion. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[GET /api/schedules/execute] --> B[verifyCronAuth]
B -->|authError| Z1[Return 401]
B -->|ok| C[getJobQueue]
C --> D[tickStart = Date.now]
D --> E{Date.now - tickStart < MAX_TICK_DURATION_MS 3 min?}
E -->|No / timeout| LOG[Log summary & return 200]
E -->|Yes| F[Promise.all: claimWorkflowSchedules + claimJobSchedules]
F --> G{both empty?}
G -->|Yes| LOG
G -->|No| H[Lazy-load workflowUtils if needed]
H --> I[Promise.allSettled: processScheduleItem x N + processJobItem x M]
I --> E
subgraph processScheduleItem
S1[buildScheduleExecutionJobId] --> S2{existingJob pending/processing?}
S2 -->|Yes| S_skip[return early]
S2 -->|stale finished| S_rel[releaseScheduleLock → return]
S2 -->|null| S3[getWorkflowById → enqueue job]
S3 --> S4{just-enqueued job already finished?}
S4 -->|Yes| S_rel2[releaseScheduleLock → return]
S4 -->|No| S5{shouldExecuteInline?}
S5 -->|Yes| S6[startJob → executeScheduleJob → completeJob]
S5 -->|No| S_done[return]
S6 -->|error| S7[markJobFailed + releaseScheduleLock]
end
subgraph processJobItem
J1[build payload] --> J2[executeJobInline]
J2 -->|error| J3[releaseScheduleLock]
end
Reviews (1): Last reviewed commit: "improvement(scheduler): drain in chunks ..." | Re-trigger Greptile |
| const [dueSchedules, dueJobs] = await Promise.all([ | ||
| claimWorkflowSchedules(queuedAt, WORKFLOW_CHUNK_SIZE), | ||
| claimJobSchedules(queuedAt, JOB_CHUNK_SIZE), | ||
| ]) | ||
|
|
||
| if (dueSchedules.length === 0 && dueJobs.length === 0) break |
There was a problem hiding this comment.
Unnecessary DB round-trips when one queue is drained
Both claim queries are always issued together, even after one queue has clearly been exhausted. When dueSchedules.length < WORKFLOW_CHUNK_SIZE (a partial fill), there are provably no more workflow schedules for this tick, yet the next iteration still fires claimWorkflowSchedules against the DB. In a scenario where thousands of jobs drain slowly but only a handful of workflow schedules exist, every iteration after the first makes a wasted claimWorkflowSchedules transaction. Tracking per-type exhaustion (e.g., let schedulesDrained = false) and skipping the corresponding claim would avoid these unnecessary row-lock round-trips.
There was a problem hiding this comment.
Addressed in 1fb247b: each queue tracks its own exhaustion flag (set when a claim returns fewer than its CHUNK_SIZE), and subsequent iterations skip the claim query for that queue. Trade-off noted in commit: we may miss rows that become newly-due mid-tick, but they're picked up by the next 60s cron, which is fine.
| await Promise.allSettled([ | ||
| ...dueSchedules.map((schedule) => | ||
| processScheduleItem(schedule, queuedAt, requestId, jobQueue, workflowUtils!) | ||
| ), |
There was a problem hiding this comment.
Non-null assertion on
workflowUtils is an implicit invariant
workflowUtils is typed as WorkflowUtils | undefined but is asserted non-null with ! when passed to processScheduleItem. The invariant that workflowUtils is always set when dueSchedules.length > 0 holds at runtime (the await import(...) completes before Promise.allSettled is called), but TypeScript cannot verify it. If processScheduleItem is ever called from another call-site without this guarantee, it will silently receive undefined and throw at runtime. Consider narrowing the type inside the call or making workflowUtils a required argument to the loop block rather than a module-level let.
There was a problem hiding this comment.
Addressed in 1fb247b: workflowUtils is captured into a local const inside the loop body and the schedule branch is only constructed when the import has completed. The non-null assertion is gone.
… workflowUtils non-null assertion Addresses Greptile review on PR #4578: - track per-queue exhaustion when a claim returns fewer than CHUNK_SIZE rows; subsequent iterations skip the claim query for that queue. Saves one DB round-trip per iteration once one queue drains while the other is still working. - narrow workflowUtils to a local const inside the loop body so the schedule processing branch only runs when the import has completed. Removes the misleading non-null assertion.
Summary
MAX_CRON_CLAIMS(200) with a chunked drain loop: claimWORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZEper iteration, process viaPromise.allSettled, repeat until both claim queries return empty orMAX_TICK_DURATION_MS(3 min) elapses.processScheduleItemandprocessJobItemso the loop body stays readable.FOR UPDATE SKIP LOCKED,lastQueuedAtas the claim signal, staleness-window reclaim) are unchanged. Successive cron invocations may overlap during a real drain — already safe because the lock model prevents double-claim.Follow-up to #4567 (the 20 → 200 bump): #4567 was a quick raise to the static cap; this PR removes the cap as a concept.
Type of Change
Testing
bunx vitest run app/api/schedules/execute/route.test.ts— 6/6 passbun run lint— cleanbun run check:api-validation:strict— passesmockResolvedValueOncefor the first iteration's claim queries; second iteration sees default empty mocks and the loop breaks, matching the expectedexecutedCount.Checklist