improvement(scheduler): drain due schedules in chunks instead of a single capped claim by TheodoreSpeaks · Pull Request #4578 · simstudioai/sim

TheodoreSpeaks · 2026-05-13T03:44:38Z

Summary

Replaces the fixed MAX_CRON_CLAIMS (200) with a chunked drain loop: claim WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE per iteration, process via Promise.allSettled, repeat until both claim queries return empty or MAX_TICK_DURATION_MS (3 min) elapses.
Per-tick throughput is no longer bounded by a static ceiling; it scales until the DB or trigger.dev is the limit. The per-iteration chunk size (100/100) still bounds the row-lock set and fan-out concurrency for each chunk.
Extracts processScheduleItem and processJobItem so the loop body stays readable.
Claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt as the claim signal, staleness-window reclaim) are unchanged. Successive cron invocations may overlap during a real drain — already safe because the lock model prevents double-claim.

Follow-up to #4567 (the 20 → 200 bump): #4567 was a quick raise to the static cap; this PR removes the cap as a concept.

Type of Change

Improvement

Testing

bunx vitest run app/api/schedules/execute/route.test.ts — 6/6 pass
bun run lint — clean
bun run check:api-validation:strict — passes
Manual reasoning: tests use mockResolvedValueOnce for the first iteration's claim queries; second iteration sees default empty mocks and the loop breaks, matching the expected executedCount.

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Replaces the fixed MAX_CRON_CLAIMS (200) with a chunked drain loop: claim WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE per iteration, process via Promise.allSettled, repeat until both claim queries return empty or MAX_TICK_DURATION_MS elapses. Throughput is no longer bounded by a static per-tick ceiling; it scales until DB or trigger.dev is the limit. Per-iteration chunk size still bounds row-lock set and fan-out concurrency. Extracts processScheduleItem and processJobItem so the loop body stays readable. Existing claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt as the claim signal, staleness reclaim) are unchanged.

vercel · 2026-05-13T03:44:46Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		May 14, 2026 1:44am

cursor · 2026-05-13T03:44:48Z

PR Summary

Medium Risk
Changes the cron execution handler from a single capped claim to a time-bounded draining loop, which can increase per-tick DB/queue load and concurrency. While claim semantics are unchanged, the new iterative behavior could surface performance or locking edge cases under heavy backlog.

Overview
Removes the single-run MAX_CRON_CLAIMS cap and replaces it with a chunked drain loop that repeatedly claims up to WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE items per iteration and processes each chunk via Promise.allSettled until no more work is found or MAX_TICK_DURATION_MS elapses.

Refactors per-item handling into processScheduleItem and processJobItem, keeping existing job de-dupe/stale-claim release behavior while improving logging/metrics to report per-iteration and total processed counts.

^{Reviewed by Cursor Bugbot for commit 1fb247b. Bugbot is set up for automated code reviews on this repo. Configure here.}

greptile-apps · 2026-05-13T03:48:23Z

Greptile Summary

This PR replaces the previous fixed MAX_CRON_CLAIMS ceiling (200 rows per invocation) with a chunked drain loop that repeatedly claims WORKFLOW_CHUNK_SIZE (100) workflow schedules and JOB_CHUNK_SIZE (100) job schedules per iteration, continuing until both queues are empty or MAX_TICK_DURATION_MS (3 minutes) elapses. The processing helpers processScheduleItem and processJobItem are extracted to keep the loop body readable, and all claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt as the claim signal, staleness-window reclaim) are preserved.

Chunked drain loop: both claim queries now run concurrently via Promise.all each iteration, and results are processed with Promise.allSettled; successive cron invocations may overlap safely because SKIP LOCKED prevents double-claim.
Lazy workflowUtils import: the dynamic import is hoisted to a module-level let and loaded at most once per invocation, regardless of iteration count.
No cap on per-tick throughput: the loop drains until empty or the 3-minute wall-clock limit, so backlogs of any size are handled without touching the constants.

Confidence Score: 4/5

Safe to merge — the drain loop correctly replaces the static cap, claim semantics are unchanged, and concurrent invocations remain safe via SKIP LOCKED.

The refactoring is logically sound and all existing tests pass. The two flagged items are optimization opportunities rather than correctness problems: the loop issues unnecessary DB calls once one queue is exhausted, and the workflowUtils non-null assertion relies on an implicit invariant that holds today but is not enforced by the type system.

apps/sim/app/api/schedules/execute/route.ts — specifically the while-loop claim strategy and the workflowUtils! assertion.

Important Files Changed

Filename	Overview
apps/sim/app/api/schedules/execute/route.ts	Replaces the static MAX_CRON_CLAIMS cap with a chunked drain loop (WORKFLOW_CHUNK_SIZE=100, JOB_CHUNK_SIZE=100) bounded by MAX_TICK_DURATION_MS (3 min); extracts processScheduleItem and processJobItem helpers. Core claim semantics and error handling are preserved. Two style-level concerns: unnecessary DB round-trips when one queue drains before the other, and a non-null assertion on workflowUtils that relies on an implicit invariant TypeScript cannot verify.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GET /api/schedules/execute] --> B[verifyCronAuth]
    B -->|authError| Z1[Return 401]
    B -->|ok| C[getJobQueue]
    C --> D[tickStart = Date.now]
    D --> E{Date.now - tickStart < MAX_TICK_DURATION_MS 3 min?}
    E -->|No / timeout| LOG[Log summary & return 200]
    E -->|Yes| F[Promise.all: claimWorkflowSchedules + claimJobSchedules]
    F --> G{both empty?}
    G -->|Yes| LOG
    G -->|No| H[Lazy-load workflowUtils if needed]
    H --> I[Promise.allSettled: processScheduleItem x N + processJobItem x M]
    I --> E

    subgraph processScheduleItem
        S1[buildScheduleExecutionJobId] --> S2{existingJob pending/processing?}
        S2 -->|Yes| S_skip[return early]
        S2 -->|stale finished| S_rel[releaseScheduleLock → return]
        S2 -->|null| S3[getWorkflowById → enqueue job]
        S3 --> S4{just-enqueued job already finished?}
        S4 -->|Yes| S_rel2[releaseScheduleLock → return]
        S4 -->|No| S5{shouldExecuteInline?}
        S5 -->|Yes| S6[startJob → executeScheduleJob → completeJob]
        S5 -->|No| S_done[return]
        S6 -->|error| S7[markJobFailed + releaseScheduleLock]
    end

    subgraph processJobItem
        J1[build payload] --> J2[executeJobInline]
        J2 -->|error| J3[releaseScheduleLock]
    end

_{Reviews (1): Last reviewed commit: "improvement(scheduler): drain in chunks ..." | Re-trigger Greptile}

greptile-apps · 2026-05-13T03:48:27Z

+      const [dueSchedules, dueJobs] = await Promise.all([
+        claimWorkflowSchedules(queuedAt, WORKFLOW_CHUNK_SIZE),
+        claimJobSchedules(queuedAt, JOB_CHUNK_SIZE),
+      ])
+
+      if (dueSchedules.length === 0 && dueJobs.length === 0) break


Unnecessary DB round-trips when one queue is drained

Both claim queries are always issued together, even after one queue has clearly been exhausted. When dueSchedules.length < WORKFLOW_CHUNK_SIZE (a partial fill), there are provably no more workflow schedules for this tick, yet the next iteration still fires claimWorkflowSchedules against the DB. In a scenario where thousands of jobs drain slowly but only a handful of workflow schedules exist, every iteration after the first makes a wasted claimWorkflowSchedules transaction. Tracking per-type exhaustion (e.g., let schedulesDrained = false) and skipping the corresponding claim would avoid these unnecessary row-lock round-trips.

Addressed in 1fb247b: each queue tracks its own exhaustion flag (set when a claim returns fewer than its CHUNK_SIZE), and subsequent iterations skip the claim query for that queue. Trade-off noted in commit: we may miss rows that become newly-due mid-tick, but they're picked up by the next 60s cron, which is fine.

greptile-apps · 2026-05-13T03:48:28Z

+      await Promise.allSettled([
+        ...dueSchedules.map((schedule) =>
+          processScheduleItem(schedule, queuedAt, requestId, jobQueue, workflowUtils!)
+        ),


Non-null assertion on workflowUtils is an implicit invariant

workflowUtils is typed as WorkflowUtils | undefined but is asserted non-null with ! when passed to processScheduleItem. The invariant that workflowUtils is always set when dueSchedules.length > 0 holds at runtime (the await import(...) completes before Promise.allSettled is called), but TypeScript cannot verify it. If processScheduleItem is ever called from another call-site without this guarantee, it will silently receive undefined and throw at runtime. Consider narrowing the type inside the call or making workflowUtils a required argument to the loop block rather than a module-level let.

Addressed in 1fb247b: workflowUtils is captured into a local const inside the loop body and the schedule branch is only constructed when the import has completed. The non-null assertion is gone.

… workflowUtils non-null assertion Addresses Greptile review on PR #4578: - track per-queue exhaustion when a claim returns fewer than CHUNK_SIZE rows; subsequent iterations skip the claim query for that queue. Saves one DB round-trip per iteration once one queue drains while the other is still working. - narrow workflowUtils to a local const inside the loop body so the schedule processing branch only runs when the import has completed. Removes the misleading non-null assertion.

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

vercel Bot temporarily deployed to Preview May 14, 2026 01:44 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improvement(scheduler): drain due schedules in chunks instead of a single capped claim#4578

improvement(scheduler): drain due schedules in chunks instead of a single capped claim#4578
TheodoreSpeaks wants to merge 2 commits into
stagingfrom
improvement/scheduler-chunked-pagination

TheodoreSpeaks commented May 13, 2026

Uh oh!

vercel Bot commented May 13, 2026 •

edited

Loading

Uh oh!

cursor Bot commented May 13, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

TheodoreSpeaks May 14, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

TheodoreSpeaks May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheodoreSpeaks commented May 13, 2026

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

greptile-apps Bot commented May 13, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

TheodoreSpeaks May 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

TheodoreSpeaks May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 13, 2026 •

edited

Loading

cursor Bot commented May 13, 2026 •

edited

Loading