fix(mothership): reconcile stuck conversation_id against Redis lock to clear stuck-yellow task tiles#4556
Conversation
…o clear stuck-yellow task tiles copilot_chats.conversation_id has no TTL/heartbeat, so when a stream process dies before the clear path runs (pod OOM, SIGKILL, uncaught throw, deploy mid-stream) the column is orphaned and the task tile renders yellow forever. The Redis lock at copilot:chat-stream-lock:<chatId> is the canonical liveness signal and self-heals via 60s TTL + 20s heartbeat, but the mothership APIs weren't consulting it. Adds read-time reconciliation: a batched MGET helper checks whether each persisted conversation_id still has a live Redis lock, and both GET /api/mothership/chats and GET /api/mothership/chats/[chatId] rewrite the marker to null when the lock has expired. No DB writes; stuck rows self-heal on next fetch.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview
Introduces Reviewed by Cursor Bugbot for commit ee39dea. Configure here. |
Greptile SummaryThis PR fixes stuck-yellow task tiles in Mothership by introducing read-time reconciliation of
Confidence Score: 5/5Safe to merge — the change is additive read-time reconciliation with no schema migrations, no new write paths beyond the conditional stale-marker repair, and graceful degradation when Redis is unavailable. The core logic is well-guarded: the DB repair uses a double-predicate WHERE (chatId AND original streamId) so it cannot race against a concurrent stream start, and the No files require special attention. Important Files Changed
Reviews (4): Last reviewed commit: "cleanup code and fix types" | Re-trigger Greptile |
…amIds is called with empty candidateIds
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit fca291b. Configure here.
|
@greptile |
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 9406a3b. Configure here.
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit ee39dea. Configure here.
|
@greptile |
Summary
The bug
Mothership task tiles render yellow/amber when the chat has an "active stream" (
copilot_chats.conversation_id IS NOT NULL). Several tasks in production were stuck yellow with no stream actually running.Root cause
copilot_chats.conversation_idis the persistent active-stream marker. It is written when a stream starts and cleared on the finalize/stop callbacks (finalizeAssistantTurn,/chat/stop). It has no TTL and no heartbeat, so if the process dies before the clear path runs (pod OOM, SIGKILL, uncaught throw, deploy mid-stream), the column is orphaned forever and the task renders yellow indefinitely.The Redis lock at
copilot:chat-stream-lock:<chatId>is the canonical liveness signal — it self-heals via 60s TTL + 20s heartbeat fromstartAbortPoller. The API routes that powered the UI were reading the DB column directly without consulting Redis.The fix
Read-time reconciliation at the API boundary: check whether the persisted
conversation_idhas a live Redis lock; if not, treat the marker asnull.getActiveChatStreamIds(chatIds[])inlib/copilot/request/session/abort.ts— batched MGET againstcopilot:chat-stream-lock:<chatId>keys. Mirrors the existinggetPendingChatStreamIdpattern (local pending map + Redis fallback, graceful degradation on Redis-null or Redis-throws).GET /api/mothership/chatsfilters chats withactiveStreamId != null, MGETs their locks, rewritesactiveStreamId = nullfor any whose lock has expired.GET /api/mothership/chats/[chatId]does the same reconciliation per-chat before reading the stream snapshot — prevents the client from trying to reconnect to a dead stream viause-chat.ts.No DB writes; the DB stays the persistent record, Redis remains the source of truth for liveness. Already-stuck rows self-heal on the next fetch.
Type of Change
Testing
lib/copilot/request/session/abort.test.ts— 5 new cases forgetActiveChatStreamIds(empty input, mixed live/expired locks, all expired, Redis unavailable, Redis throws)app/api/mothership/chats/route.test.ts(new) — 4 cases: stuck-yellow reconciled, no Redis lookup when no candidates, live locks pass through, 401 unauthenticatedapp/api/mothership/chats/[chatId]/route.test.ts(new) — 5 cases: stuck conversationId nulled and no snapshot read, live lock returns snapshot, null conversationId short-circuits Redis call, 404, 40120 tests passing.
bun run check:api-validationpasses.Checklist