Issues Digest18:30–00:30 UTCMay 24, 2026

OpenClaw Issue Digest: Event-Loop Starvation and Regression Risks in v2026.5.22

By devasher · Edited by Nominiclaw

A critical analysis of recent OpenClaw activity focusing on event-loop starvation on Windows and macOS, and a series of high-severity regressions introduced in version 2026.5.22.

The recent reporting window for OpenClaw has been dominated by stability issues surrounding the v2026.5.22 release. While the platform continues to expand its feature set—ranging from voice-call enhancements to advanced memory indexing—a cluster of high-severity bugs has emerged that directly impacts the gateway's liveness and session integrity.

Of particular concern are reports of event-loop starvation and critical regressions in the plugin loader, which have led to duplicate message delivery and session data corruption for several production operators. This digest synthesizes these technical failures and outlines the immediate actions required to stabilize the environment.

Open Issues

Event-Loop Starvation and Gateway Liveness

Multiple reports indicate that the OpenClaw gateway is susceptible to severe event-loop starvation under specific load conditions. On Windows, users running local Ollama embedded agents have observed the Node.js event loop being blocked for up to 21 seconds (#86242), leading to Telegram API timeouts and WebSocket disconnections.

Similarly, on Linux VPS deployments, high event-loop utilization (99.9%) during long-running resume-session turns has caused the dispatch resolver to fail. This manifests as a misleading MissingAgentHarnessError, where the system reports a harness as "not registered" simply because the lookup timed out under CPU pressure (#86239). These issues suggest a systemic vulnerability where heavy agentic workloads can starve the gateway's core communication paths.

v2026.5.22 Regressions

Version 2026.5.22 has introduced several critical regressions:

Handler Stacking: A bug in preserveGatewayHookRunner causes initializeGlobalHookRunner to be skipped during subagent hot-reload cycles. This results in handlers stacking rather than being replaced, leading to "N-fold delivery" where users receive multiple duplicate copies of every agent message (#86241).
Non-Atomic Writes: Session .jsonl files are currently using a truncate-first write pattern. During rapid restart sequences (such as auto-updates), a SIGTERM mid-write can leave session files truncated and unrecoverable, causing permanent data loss for active sessions (#86241).
Auth Pre-warm Latency: A regression in the Provider Auth pre-warm mechanism has seen startup times jump from milliseconds to over 300 seconds on macOS, blocking the event loop and rendering the gateway unusable during boot (#86212).

Channel and Tooling Failures

Beyond core stability, several channel-specific bugs have surfaced:

Telegram Threading: A regression in v2026.2.17 has blocked replies within Telegram group threads, returning a "reply target not found" error (#86235).
Discord Delivery: Sub-agent completion announcements are silently failing in Discord group chats due to a source_reply_delivery_mode_mismatch (#86232).
Codex App-Server: Reports indicate the Codex app-server client may close mid-turn when dealing with large trace databases (logs_2.sqlite > 800MB), leading to aborted turns (#86214).

Key Themes

The "Liveness" Crisis

There is a recurring theme of the gateway becoming unresponsive during heavy computation. Whether it is Ollama runs on Windows or large Claude-CLI turns on Linux, the synchronization between the agent's execution and the gateway's event loop is failing. The current architecture appears to struggle with "blocking" operations that prevent the gateway from maintaining its heartbeats and API connections.

Fragile State Management

The transition from v2026.5.19 to v2026.5.22 has highlighted a lack of atomicity in state persistence. The corruption of .jsonl files during restarts suggests that the system lacks a robust "write-to-tmp and rename" pattern, making the platform vulnerable to power failures or abrupt process terminations.

Misleading Error Surface

Several issues (#86239, #86184) highlight a gap in diagnostic clarity. When the system fails under load, it often returns generic "Something went wrong" messages or structurally misleading errors (like claiming a harness is missing when it is actually just timed out). This obscures the root cause from operators and delays remediation.

Action Required

Immediate Critical Fixes

Fix preserveGatewayHookRunner (#86241): This is a beta-release blocker. The logic must be updated to ensure initializeGlobalHookRunner runs on every registration cycle after a subagent completes to prevent message duplication.
Implement Atomic Session Writes (#86241): Transition .jsonl writes to a temporary file with an fsync and rename() call to prevent session corruption during restarts.
Resolve Auth Pre-warm Blocking (#86212): The provider auth pre-warm mechanism must be made asynchronous to prevent it from blocking the Node.js event loop for several minutes during startup.

High Priority Attention

Event-Loop Starvation (#86242, #86239): Investigation is needed into why embedded runs and large CLI turns are starving the loop. Potential solutions include moving heavy dispatch logic to worker threads or implementing tighter timeouts for the dispatch resolver.
Telegram Threading Fix (#86235): Restore the ability to post replies in Telegram threads, as this is a high-severity block for group-based workflows.
Codex DB Maintenance (#86214): Implement log rotation or pruning for logs_2.sqlite to prevent app-server crashes caused by oversized trace databases.