Skip to content

fix(mothership): reconcile stuck conversation_id against Redis lock to clear stuck-yellow task tiles#4556

Merged
icecrasher321 merged 5 commits into
stagingfrom
waleedlatif1/task-yellow-bug
May 13, 2026
Merged

fix(mothership): reconcile stuck conversation_id against Redis lock to clear stuck-yellow task tiles#4556
icecrasher321 merged 5 commits into
stagingfrom
waleedlatif1/task-yellow-bug

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

The bug

Mothership task tiles render yellow/amber when the chat has an "active stream" (copilot_chats.conversation_id IS NOT NULL). Several tasks in production were stuck yellow with no stream actually running.

Root cause

copilot_chats.conversation_id is the persistent active-stream marker. It is written when a stream starts and cleared on the finalize/stop callbacks (finalizeAssistantTurn, /chat/stop). It has no TTL and no heartbeat, so if the process dies before the clear path runs (pod OOM, SIGKILL, uncaught throw, deploy mid-stream), the column is orphaned forever and the task renders yellow indefinitely.

The Redis lock at copilot:chat-stream-lock:<chatId> is the canonical liveness signal — it self-heals via 60s TTL + 20s heartbeat from startAbortPoller. The API routes that powered the UI were reading the DB column directly without consulting Redis.

The fix

Read-time reconciliation at the API boundary: check whether the persisted conversation_id has a live Redis lock; if not, treat the marker as null.

  • New helper getActiveChatStreamIds(chatIds[]) in lib/copilot/request/session/abort.ts — batched MGET against copilot:chat-stream-lock:<chatId> keys. Mirrors the existing getPendingChatStreamId pattern (local pending map + Redis fallback, graceful degradation on Redis-null or Redis-throws).
  • GET /api/mothership/chats filters chats with activeStreamId != null, MGETs their locks, rewrites activeStreamId = null for any whose lock has expired.
  • GET /api/mothership/chats/[chatId] does the same reconciliation per-chat before reading the stream snapshot — prevents the client from trying to reconnect to a dead stream via use-chat.ts.

No DB writes; the DB stays the persistent record, Redis remains the source of truth for liveness. Already-stuck rows self-heal on the next fetch.

Type of Change

  • Bug fix

Testing

  • lib/copilot/request/session/abort.test.ts — 5 new cases for getActiveChatStreamIds (empty input, mixed live/expired locks, all expired, Redis unavailable, Redis throws)
  • app/api/mothership/chats/route.test.ts (new) — 4 cases: stuck-yellow reconciled, no Redis lookup when no candidates, live locks pass through, 401 unauthenticated
  • app/api/mothership/chats/[chatId]/route.test.ts (new) — 5 cases: stuck conversationId nulled and no snapshot read, live lock returns snapshot, null conversationId short-circuits Redis call, 404, 401

20 tests passing. bun run check:api-validation passes.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

…o clear stuck-yellow task tiles

copilot_chats.conversation_id has no TTL/heartbeat, so when a stream
process dies before the clear path runs (pod OOM, SIGKILL, uncaught
throw, deploy mid-stream) the column is orphaned and the task tile
renders yellow forever. The Redis lock at copilot:chat-stream-lock:<chatId>
is the canonical liveness signal and self-heals via 60s TTL + 20s
heartbeat, but the mothership APIs weren't consulting it.

Adds read-time reconciliation: a batched MGET helper checks whether
each persisted conversation_id still has a live Redis lock, and both
GET /api/mothership/chats and GET /api/mothership/chats/[chatId]
rewrite the marker to null when the lock has expired. No DB writes;
stuck rows self-heal on next fetch.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 13, 2026 1:43am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 11, 2026

PR Summary

Medium Risk
Changes runtime behavior of mothership chat listing/detail APIs to rewrite/repair persisted conversationId based on Redis lock liveness, including opportunistic DB updates; incorrect reconciliation could hide live streams or clear markers unexpectedly during Redis issues.

Overview
Fixes the stuck-yellow state by reconciling persisted chat stream markers (copilot_chats.conversation_id / activeStreamId) against the canonical Redis chat-stream lock.

GET /api/mothership/chats and GET /api/mothership/chats/[chatId] now call reconcileChatStreamMarkers to (a) clear markers when Redis verifies the lock is gone, (b) prefer the Redis owner when it disagrees with the DB, and (c) avoid clearing when Redis state is unknown; the detail route also gates stream snapshot reads on the reconciled stream id.

Introduces getChatStreamLockOwners (batched MGET with graceful degradation) and reconcileChatStreamMarkers (optionally repairs verified-stale markers in the DB), plus new Vitest coverage for the helper and both API routes. Also adjusts use-chat.ts stream recovery to distinguish between "loaded but null" vs "failed to load" so cached stream ids can be used as a fallback.

Reviewed by Cursor Bugbot for commit ee39dea. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR fixes stuck-yellow task tiles in Mothership by introducing read-time reconciliation of copilot_chats.conversation_id (the persisted active-stream marker) against the Redis stream lock, which is the canonical liveness signal with a 60 s TTL and 20 s heartbeat.

  • New reconcileChatStreamMarkers helper in lib/copilot/chat/stream-liveness.ts batches a Redis MGET against copilot:chat-stream-lock:<chatId> keys; when Redis confirms a lock is gone (verified), stale DB markers are treated as null and optionally back-filled with a safe conditional UPDATE. When Redis is unreachable (unknown), persisted markers are preserved to avoid false-negatives in multi-pod deployments.
  • Both mothership chat API routes now run reconciliation on every read before returning activeStreamId to the client, replacing the old direct column read.
  • use-chat.ts recovery path removes the stale-cache fast-path from getActiveStreamIdForChat so recovery always fetches the server-reconciled value instead of returning a potentially orphaned cached stream ID.

Confidence Score: 5/5

Safe to merge — the change is additive read-time reconciliation with no schema migrations, no new write paths beyond the conditional stale-marker repair, and graceful degradation when Redis is unavailable.

The core logic is well-guarded: the DB repair uses a double-predicate WHERE (chatId AND original streamId) so it cannot race against a concurrent stream start, and the unknown status path preserves persisted markers rather than clearing them, preventing false-negatives in multi-pod deployments. Tests cover all meaningful branches including Redis unavailability, lock mismatch, verified-empty, and the error fallback.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/lib/copilot/chat/stream-liveness.ts New reconciliation helper – correctly handles verified/unknown Redis status, uses a safe conditional WHERE clause for the optional DB repair, and short-circuits for chats with null streamIds without touching Redis.
apps/sim/lib/copilot/request/session/abort.ts New getChatStreamLockOwners batches a Redis MGET; correctly returns verified+Redis map on success, unknown+local-pending fallback when Redis is unavailable, and trusts verified Redis null over local-pending entries.
apps/sim/app/api/mothership/chats/route.ts Reconciliation added after DB query; passes all chats but reconcileChatStreamMarkers short-circuits null-streamId ones without a Redis call.
apps/sim/app/api/mothership/chats/[chatId]/route.ts Reconciliation applied per-chat before stream snapshot read; correctly uses liveStreamId for all downstream reads and the response field rename from conversationId to activeStreamId.
apps/sim/app/workspace/[workspaceId]/home/hooks/use-chat.ts Removes the stale-cache fast-path so every recovery attempt fetches the server-reconciled value; loaded discriminant allows clean fallback to locally-known refs on fetch failure.
apps/sim/hooks/queries/tasks.ts Unifies parseChatHistory to use activeStreamId for both copilot and mothership sources, consistent with the API field rename.
apps/sim/lib/api/contracts/mothership-tasks.ts Schema updated to replace conversationId with activeStreamId in getMothershipChatResponseSchema.

Reviews (4): Last reviewed commit: "cleanup code and fix types" | Re-trigger Greptile

Comment thread apps/sim/app/api/mothership/chats/route.test.ts Outdated
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit fca291b. Configure here.

@icecrasher321
Copy link
Copy Markdown
Collaborator

@greptile

@icecrasher321
Copy link
Copy Markdown
Collaborator

bugbot run

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 9406a3b. Configure here.

@icecrasher321
Copy link
Copy Markdown
Collaborator

bugbot run

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit ee39dea. Configure here.

@icecrasher321
Copy link
Copy Markdown
Collaborator

@greptile

@icecrasher321 icecrasher321 merged commit 773cd84 into staging May 13, 2026
13 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/task-yellow-bug branch May 13, 2026 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants