fix(webapp): auto-recover replication services after stream errors by ericallam · Pull Request #3613 · triggerdotdev/trigger.dev

ericallam · 2026-05-13T15:54:42Z

Summary

When the logical-replication stream errored (most commonly after a Postgres failover), the runs and sessions replication services logged the error and left the underlying client stopped. The host process kept running, the WAL backed up, and ClickHouse silently fell behind.

Fix

Both services now run a configurable recovery strategy on stream errors, defaulting to in-process reconnect with exponential backoff so a fresh self-hosted setup heals on its own.

reconnect (default) — re-subscribe with exponential backoff (1s → 60s cap, unlimited attempts). LogicalReplicationClient.subscribe(lastLsn) re-validates the publication, re-acquires the leader lock, and resumes from the last acknowledged LSN.
exit — process.exit(1) after a short flush window so a host supervisor (Docker restart=always, systemd, k8s) can replace the process.
log — preserves the old behaviour.

Per-service strategy + exit knobs are env-driven (RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY + *_EXIT_DELAY_MS, *_EXIT_CODE). Reconnect tuning is shared across both services (REPLICATION_RECONNECT_INITIAL_DELAY_MS, _MAX_DELAY_MS, _MAX_ATTEMPTS; MAX_ATTEMPTS=0 means unlimited).

Test plan

Integration tests cover all three strategies by simulating a failover with pg_terminate_backend against the WAL sender:

reconnect — kill the backend, insert a new row, assert it lands in ClickHouse
exit — kill the backend, assert process.exit(1) is called
log — kill the backend, insert a new row, assert it does not land in ClickHouse

pnpm --filter webapp test --run runsReplicationService.errorRecovery

changeset-bot · 2026-05-13T15:54:55Z

⚠️ No Changeset found

Latest commit: 5ba46ff

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-05-13T15:54:59Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 95becee5-3685-4049-8a15-ae068c3c1b73

📥 Commits

Reviewing files that changed from the base of the PR and between 6f8cc24 and 5ba46ff.

📒 Files selected for processing (8)

.server-changes/replication-error-recovery.md
apps/webapp/app/env.server.ts
apps/webapp/app/services/replicationErrorRecovery.server.ts
apps/webapp/app/services/runsReplicationInstance.server.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/app/services/sessionsReplicationInstance.server.ts
apps/webapp/app/services/sessionsReplicationService.server.ts
apps/webapp/test/runsReplicationService.errorRecovery.test.ts

✅ Files skipped from review due to trivial changes (1)

.server-changes/replication-error-recovery.md

🚧 Files skipped from review as they are similar to previous changes (7)

apps/webapp/app/services/runsReplicationInstance.server.ts
apps/webapp/app/services/replicationErrorRecovery.server.ts
apps/webapp/app/services/sessionsReplicationInstance.server.ts
apps/webapp/app/env.server.ts
apps/webapp/test/runsReplicationService.errorRecovery.test.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/app/services/sessionsReplicationService.server.ts

Walkthrough

This PR adds configurable error recovery for the runs and sessions replication services. When a logical replication stream fails (e.g., during a database failover), the system can reconnect with exponential backoff, exit to let an external supervisor restart the host, or remain stopped with logging. Environment variables control per-service strategy selection and tuning. The implementation integrates into both services' lifecycle (on error, stream start, and shutdown) and is validated through containerized integration tests that force replication stream failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding auto-recovery for replication services after stream errors, which is the core objective of this PR.
Description check	✅ Passed	The description comprehensively covers the problem, fix, recovery strategies, configuration, and test plan. It exceeds the template sections with detailed technical information, but follows the spirit of clear communication expected in the template.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/replication-auto-recover-on-stream-error

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

When the underlying logical-replication client errored (e.g. after a Postgres failover), the runs and sessions replication services logged the error and left the stream stopped. The host process kept running, the WAL backed up, and ClickHouse silently fell behind. Both services now run a configurable recovery strategy on stream errors, defaulting to in-process reconnect with exponential backoff so a fresh self-hosted setup heals on its own: - "reconnect" (default) re-subscribes via the existing subscribe(lastLsn) path with exponential backoff (1s -> 60s cap, unlimited attempts), which re-validates the publication, re-acquires the leader lock, and resumes from the last acknowledged LSN. - "exit" calls process.exit after a short flush window so a host's supervisor (Docker restart=always, systemd, k8s, etc.) can replace the process. - "log" preserves the historical behaviour. Per-service strategy + exit knobs are env-driven via RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY plus matching *_EXIT_DELAY_MS / *_EXIT_CODE. Reconnect tuning is shared across both services via REPLICATION_RECONNECT_INITIAL_DELAY_MS / _MAX_DELAY_MS / _MAX_ATTEMPTS (0 = unlimited).

Addresses PR review feedback: - LogicalReplicationClient.subscribe() can throw before its internal "error" listener is wired up (notably when pg client.connect() fails mid-failover). The reconnect strategy's catch block only logged, so recovery silently stopped. Now also calls scheduleReconnect(err) — the pendingReconnect guard makes it idempotent if an error event was also emitted. - Reject negative values for the new replication-recovery env vars and cap exit codes at 255. - Convert the new ReplicationErrorRecovery{Deps,} interfaces to type aliases to match the repo's TypeScript style. - Tighten the reconnect dep comment to drop a stale "lastAcknowledgedLsn" reference (the wrapper-tracked resume LSN is what callers actually pass). - Restore process.exit after service.shutdown() in the exit-strategy test so a delayed exit timer can't terminate the test worker.

devin-ai-integration

Devin Review found 1 new potential issue.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-05-14T09:43:30Z

+    pendingReconnect = setTimeout(async () => {
+      pendingReconnect = null;
+      if (isShuttingDown()) return;
+
+      try {
+        await reconnect();
+        // Success path is handled by notifyStreamStarted, which fires from
+        // the replication client's "start" event after the stream is live.
+      } catch (err) {
+        // subscribe() can throw without first emitting an "error" event —
+        // notably when the initial pg client.connect() fails because Postgres
+        // is still unreachable mid-failover. Schedule the next attempt
+        // ourselves so recovery doesn't silently stop. If subscribe() did
+        // also emit an "error" event, handle() will call scheduleReconnect()
+        // first; the guard on pendingReconnect makes this idempotent.
+        logger.error("Replication reconnect attempt failed", {
+          attempt,
+          error: err,
+        });
+        scheduleReconnect(err);
+      }
+    }, delay);


🚩 Reconnect silently stalls if subscribe() fails to acquire the leader lock

When the reconnect callback calls this._replicationClient.subscribe(...) (replicationErrorRecovery.server.ts:90), the subscribe() method in client.ts:240 may fail to acquire the leader lock (line 254-258). In that case, it emits leaderElection(false), calls this.stop(), and returns — without throwing and without emitting an error event. From the recovery module's perspective, the reconnect() promise resolved successfully, so it does not schedule another attempt. But notifyStreamStarted() will never fire either (no start event is emitted), so the attempt counter is never reset. The stream is now permanently dead with no further recovery.

In practice this would only occur in multi-replica deployments where another instance wins the leader lock during the reconnect window. Since the other instance is handling replication, this may be acceptable — but the current instance has no path to ever resume if the other instance later dies. This is a design limitation rather than a clear bug, but worth understanding.

Was this helpful? React with 👍 or 👎 to provide feedback.

This comment was marked as resolved.

Sign in to view

ericallam added 2 commits May 14, 2026 10:31

ericallam force-pushed the fix/replication-auto-recover-on-stream-error branch from 6f8cc24 to 5ba46ff Compare May 14, 2026 09:33

devin-ai-integration Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(webapp): auto-recover replication services after stream errors#3613

fix(webapp): auto-recover replication services after stream errors#3613
ericallam wants to merge 2 commits into
mainfrom
fix/replication-auto-recover-on-stream-error

ericallam commented May 13, 2026

Uh oh!

changeset-bot Bot commented May 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ericallam commented May 13, 2026

Summary

Fix

Test plan

Uh oh!

changeset-bot Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading