Issues Digest12:30–18:30 UTCMay 25, 2026

OpenClaw Issue Digest: Event-Loop Starvation and Session Stability Regressions

By devasher · Edited by Nominiclaw

A critical analysis of recent OpenClaw activity reveals severe event-loop starvation during startup and systemic session write-lock failures affecting reliability.

Recent activity in the OpenClaw repository highlights a series of critical regressions in version 2026.5.22, primarily centered around event-loop starvation during gateway startup and systemic instability in session state management. These issues are causing cascading failures across multiple channel integrations, including Discord and Feishu, and are severely impacting the reliability of isolated cron jobs and subagent orchestration.

Open Issues

Event-Loop Starvation and Startup Stalls

Multiple reports indicate that v2026.5.22 introduces severe event-loop starvation. The primary culprit appears to be the warmCurrentProviderAuthState process, which synchronously probes multiple model providers during startup. This blocks the Node.js event loop for 60-90 seconds, leading to:

Channel Timeouts: Discord gateway READY timeouts and Feishu bot identity probe failures.
API Latency: Basic API calls like chat.history jumping from milliseconds to over 11 seconds.
Systemic Unresponsiveness: TUI and WebChat turns remaining stuck in "In Progress" states due to blocked I/O.

Specific reports from Windows and Linux environments confirm that this starvation is not network-related but architectural, as direct curl requests to the same endpoints remain fast while the gateway is stalled.

Session Lock and State Corruption

There is a significant cluster of issues regarding SessionWriteLockTimeoutError. The gateway's session write-lock mechanism is failing to release locks after embedded runs timeout or fail, effectively wedging sessions for 60 seconds or until a manual restart.

Furthermore, a critical race condition exists in the EmbeddedAttemptSessionTakeoverError path. When two lanes (e.g., a heartbeat lane and a channel lane) access the same session file, the fence fingerprint changes during the provider stream call, causing the original lane to abort. This results in approximately 6% of turns silently dropping the user-facing reply.

Subagent and Cron Job Failures

Isolated cron jobs and subagent workflows are experiencing high failure rates:

Tool Stripping: In v2026.5.22, Codex native code mode is disabled when the exec host is node, stripping exec, read, write, and edit tools from all isolated sessions.
Orphaned Callbacks: image_generate callbacks from failed isolated cron runs are being routed to subsequent runs of the same job, poisoning new runs with stale data.
Silent Loss: Subagent completion announcements are failing silently after three rapid retries, with no persistent fallback, leading to permanent loss of results.

Key Themes

The "Startup Wall"

There is a recurring theme of "startup friction." Beyond the auth pre-warming, the openclaw doctor command is reporting hangs on Windows, and the auto-update mechanism is failing on npm 11+/pnpm installs due to hardlink rejections during the swap step. These issues collectively make the upgrade path to v2026.5.22 unstable for many users.

Reliability vs. Orchestration

While OpenClaw's orchestration capabilities (subagents, multi-channel routing) remain a differentiator, the underlying state management (session locks, transcript persistence) is currently a bottleneck. The transition to a more robust session metadata store (e.g., SQLite) has been suggested to replace the monolithic sessions.json which is causing V8 deserialization crashes in doctor on large installations.

Provider-Specific Regressions

Discord: Bare numeric channel IDs now trigger "Ambiguous Discord recipient" errors due to tightened parsing logic.
xAI: OAuth refresh tokens are stored but not used for auto-renewal, forcing manual re-auth every 6 hours.
Ollama: Kimi models are leaking inline reasoning text into the chat output because the Ollama provider lacks a response-level reasoning stripper.

Action Required

High Severity / Blockers

Fix warmCurrentProviderAuthState: This must be moved to a background task or modified to yield the event loop to prevent gateway-wide startup stalls.
Resolve SessionWriteLockTimeoutError: Implement a guaranteed lock release in the embedded run cleanup path to prevent session wedging.
Restore Codex Native Tools: Fix the logic that disables code mode on Node.js-hosted gateways to restore exec capabilities for isolated cron jobs.

Blocked / Immediate Attention

EmbeddedAttemptSessionTakeoverError: This race condition requires a registry of active embedded-prompt holders to prevent multiple lanes from competing for the same session file.
Discord Recipient Parsing: Restore the defaultKind: "channel" fallback for bare numeric IDs to fix the message tool regression.
Subagent Outbox: Implement a persistent outbox for subagent completions to prevent silent data loss after retry limits are hit.