Skip to content

improvement(scheduler): drain due schedules in chunks instead of a single capped claim#4578

Open
TheodoreSpeaks wants to merge 2 commits into
stagingfrom
improvement/scheduler-chunked-pagination
Open

improvement(scheduler): drain due schedules in chunks instead of a single capped claim#4578
TheodoreSpeaks wants to merge 2 commits into
stagingfrom
improvement/scheduler-chunked-pagination

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

Summary

  • Replaces the fixed MAX_CRON_CLAIMS (200) with a chunked drain loop: claim WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE per iteration, process via Promise.allSettled, repeat until both claim queries return empty or MAX_TICK_DURATION_MS (3 min) elapses.
  • Per-tick throughput is no longer bounded by a static ceiling; it scales until the DB or trigger.dev is the limit. The per-iteration chunk size (100/100) still bounds the row-lock set and fan-out concurrency for each chunk.
  • Extracts processScheduleItem and processJobItem so the loop body stays readable.
  • Claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt as the claim signal, staleness-window reclaim) are unchanged. Successive cron invocations may overlap during a real drain — already safe because the lock model prevents double-claim.

Follow-up to #4567 (the 20 → 200 bump): #4567 was a quick raise to the static cap; this PR removes the cap as a concept.

Type of Change

  • Improvement

Testing

  • bunx vitest run app/api/schedules/execute/route.test.ts — 6/6 pass
  • bun run lint — clean
  • bun run check:api-validation:strict — passes
  • Manual reasoning: tests use mockResolvedValueOnce for the first iteration's claim queries; second iteration sees default empty mocks and the loop breaks, matching the expected executedCount.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Replaces the fixed MAX_CRON_CLAIMS (200) with a chunked drain loop:
claim WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE per iteration, process via
Promise.allSettled, repeat until both claim queries return empty or
MAX_TICK_DURATION_MS elapses. Throughput is no longer bounded by a
static per-tick ceiling; it scales until DB or trigger.dev is the
limit. Per-iteration chunk size still bounds row-lock set and fan-out
concurrency.

Extracts processScheduleItem and processJobItem so the loop body stays
readable. Existing claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt
as the claim signal, staleness reclaim) are unchanged.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 14, 2026 1:44am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 13, 2026

PR Summary

Medium Risk
Changes the cron execution handler from a single capped claim to a time-bounded draining loop, which can increase per-tick DB/queue load and concurrency. While claim semantics are unchanged, the new iterative behavior could surface performance or locking edge cases under heavy backlog.

Overview
Removes the single-run MAX_CRON_CLAIMS cap and replaces it with a chunked drain loop that repeatedly claims up to WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE items per iteration and processes each chunk via Promise.allSettled until no more work is found or MAX_TICK_DURATION_MS elapses.

Refactors per-item handling into processScheduleItem and processJobItem, keeping existing job de-dupe/stale-claim release behavior while improving logging/metrics to report per-iteration and total processed counts.

Reviewed by Cursor Bugbot for commit 1fb247b. Bugbot is set up for automated code reviews on this repo. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR replaces the previous fixed MAX_CRON_CLAIMS ceiling (200 rows per invocation) with a chunked drain loop that repeatedly claims WORKFLOW_CHUNK_SIZE (100) workflow schedules and JOB_CHUNK_SIZE (100) job schedules per iteration, continuing until both queues are empty or MAX_TICK_DURATION_MS (3 minutes) elapses. The processing helpers processScheduleItem and processJobItem are extracted to keep the loop body readable, and all claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt as the claim signal, staleness-window reclaim) are preserved.

  • Chunked drain loop: both claim queries now run concurrently via Promise.all each iteration, and results are processed with Promise.allSettled; successive cron invocations may overlap safely because SKIP LOCKED prevents double-claim.
  • Lazy workflowUtils import: the dynamic import is hoisted to a module-level let and loaded at most once per invocation, regardless of iteration count.
  • No cap on per-tick throughput: the loop drains until empty or the 3-minute wall-clock limit, so backlogs of any size are handled without touching the constants.

Confidence Score: 4/5

Safe to merge — the drain loop correctly replaces the static cap, claim semantics are unchanged, and concurrent invocations remain safe via SKIP LOCKED.

The refactoring is logically sound and all existing tests pass. The two flagged items are optimization opportunities rather than correctness problems: the loop issues unnecessary DB calls once one queue is exhausted, and the workflowUtils non-null assertion relies on an implicit invariant that holds today but is not enforced by the type system.

apps/sim/app/api/schedules/execute/route.ts — specifically the while-loop claim strategy and the workflowUtils! assertion.

Important Files Changed

Filename Overview
apps/sim/app/api/schedules/execute/route.ts Replaces the static MAX_CRON_CLAIMS cap with a chunked drain loop (WORKFLOW_CHUNK_SIZE=100, JOB_CHUNK_SIZE=100) bounded by MAX_TICK_DURATION_MS (3 min); extracts processScheduleItem and processJobItem helpers. Core claim semantics and error handling are preserved. Two style-level concerns: unnecessary DB round-trips when one queue drains before the other, and a non-null assertion on workflowUtils that relies on an implicit invariant TypeScript cannot verify.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GET /api/schedules/execute] --> B[verifyCronAuth]
    B -->|authError| Z1[Return 401]
    B -->|ok| C[getJobQueue]
    C --> D[tickStart = Date.now]
    D --> E{Date.now - tickStart < MAX_TICK_DURATION_MS 3 min?}
    E -->|No / timeout| LOG[Log summary & return 200]
    E -->|Yes| F[Promise.all: claimWorkflowSchedules + claimJobSchedules]
    F --> G{both empty?}
    G -->|Yes| LOG
    G -->|No| H[Lazy-load workflowUtils if needed]
    H --> I[Promise.allSettled: processScheduleItem x N + processJobItem x M]
    I --> E

    subgraph processScheduleItem
        S1[buildScheduleExecutionJobId] --> S2{existingJob pending/processing?}
        S2 -->|Yes| S_skip[return early]
        S2 -->|stale finished| S_rel[releaseScheduleLock → return]
        S2 -->|null| S3[getWorkflowById → enqueue job]
        S3 --> S4{just-enqueued job already finished?}
        S4 -->|Yes| S_rel2[releaseScheduleLock → return]
        S4 -->|No| S5{shouldExecuteInline?}
        S5 -->|Yes| S6[startJob → executeScheduleJob → completeJob]
        S5 -->|No| S_done[return]
        S6 -->|error| S7[markJobFailed + releaseScheduleLock]
    end

    subgraph processJobItem
        J1[build payload] --> J2[executeJobInline]
        J2 -->|error| J3[releaseScheduleLock]
    end
Loading

Reviews (1): Last reviewed commit: "improvement(scheduler): drain in chunks ..." | Re-trigger Greptile

Comment on lines +336 to +341
const [dueSchedules, dueJobs] = await Promise.all([
claimWorkflowSchedules(queuedAt, WORKFLOW_CHUNK_SIZE),
claimJobSchedules(queuedAt, JOB_CHUNK_SIZE),
])

if (dueSchedules.length === 0 && dueJobs.length === 0) break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unnecessary DB round-trips when one queue is drained

Both claim queries are always issued together, even after one queue has clearly been exhausted. When dueSchedules.length < WORKFLOW_CHUNK_SIZE (a partial fill), there are provably no more workflow schedules for this tick, yet the next iteration still fires claimWorkflowSchedules against the DB. In a scenario where thousands of jobs drain slowly but only a handful of workflow schedules exist, every iteration after the first makes a wasted claimWorkflowSchedules transaction. Tracking per-type exhaustion (e.g., let schedulesDrained = false) and skipping the corresponding claim would avoid these unnecessary row-lock round-trips.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1fb247b: each queue tracks its own exhaustion flag (set when a claim returns fewer than its CHUNK_SIZE), and subsequent iterations skip the claim query for that queue. Trade-off noted in commit: we may miss rows that become newly-due mid-tick, but they're picked up by the next 60s cron, which is fine.

Comment on lines +355 to +358
await Promise.allSettled([
...dueSchedules.map((schedule) =>
processScheduleItem(schedule, queuedAt, requestId, jobQueue, workflowUtils!)
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-null assertion on workflowUtils is an implicit invariant

workflowUtils is typed as WorkflowUtils | undefined but is asserted non-null with ! when passed to processScheduleItem. The invariant that workflowUtils is always set when dueSchedules.length > 0 holds at runtime (the await import(...) completes before Promise.allSettled is called), but TypeScript cannot verify it. If processScheduleItem is ever called from another call-site without this guarantee, it will silently receive undefined and throw at runtime. Consider narrowing the type inside the call or making workflowUtils a required argument to the loop block rather than a module-level let.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1fb247b: workflowUtils is captured into a local const inside the loop body and the schedule branch is only constructed when the import has completed. The non-null assertion is gone.

… workflowUtils non-null assertion

Addresses Greptile review on PR #4578:
- track per-queue exhaustion when a claim returns fewer than CHUNK_SIZE
  rows; subsequent iterations skip the claim query for that queue. Saves
  one DB round-trip per iteration once one queue drains while the other
  is still working.
- narrow workflowUtils to a local const inside the loop body so the
  schedule processing branch only runs when the import has completed.
  Removes the misleading non-null assertion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant