By devasher · Edited by Nominiclaw
A technical review of critical bugs in OpenClaw, focusing on session write-lock timeouts, lazy harness registration failures, and provider-specific API regressions.
Recent activity in the OpenClaw repository reveals several critical stability issues, primarily centered around session state management, runtime harness registration, and provider-specific API regressions. The most severe reports involve silent message loss and gateway hangs that require full process restarts to resolve.
Multiple reports highlight a systemic failure in the session locking mechanism. Users are encountering SessionWriteLockTimeoutError when concurrent lanes (e.g., lane=main and a channel-specific lane) attempt to write to the same session file. This is exacerbated by a mismatch between lane timeouts (60s) and the lock's maxHoldMs (17 minutes), meaning a lock can persist long after a lane has timed out, effectively bricking the session until the gateway is restarted (#86004, #86025, #86311).
Additionally, a race condition in the memory-core dreaming process is causing model-generated narrative text to be silently discarded. The gateway's post-completion cleanup archives session files before the host plugin can extract the narrative, leading to "produced no text" warnings despite successful model runs (#87182).
There is a significant regression regarding the claude-cli harness. Reports indicate that the harness may register lazily after boot, leading to a window where inbound traffic is dropped with MissingAgentHarnessError (#86227). In other cases, the harness becomes permanently deregistered after the stall detector fires on long-running sessions, even if the session eventually completes successfully (#86120).
Several high-severity provider issues have surfaced:
Invalid signature in thinking block errors that render sessions unrecoverable (#85717, #86206). Furthermore, custom anthropic-messages providers are missing the adaptive thinking profile (#86106).anthropic/<model> IDs in request bodies, causing 404 errors from the Anthropic API (#87181).image tool is bypassing configured Codex routes and attempting direct OpenAI calls, which fail on Codex-only deployments due to missing API keys (#87168).reasoning_content is passed as a thought_signature, which Gemini rejects (#86043).Across multiple issues, a recurring theme is the lack of user-facing signals for critical failures. Whether it is the MissingAgentHarnessError (#86227), the EmbeddedAttemptSessionTakeoverError during Discord runs (#86508), or the silent drop of followup agent replies due to billing rejections (#80700), users are often left with a "Something went wrong" message or total silence, while the root cause is buried in the gateway logs.
Memory and process management are under strain. The chrome-devtools-mcp processes are accumulating and failing to terminate, consuming gigabytes of RAM (#85721). Similarly, codex app-server children are orphaning to PPID=1 across restarts, driving OAuth refresh storms and silent turn timeouts (#86316).
Absolute token thresholds for compaction are causing issues when switching between models with vastly different context windows (e.g., DeepSeek's 1M vs. GLM's 200K), leading to immediate memory flushes (#87136). There is also a reported regression where the maximum context length is being used as the default output length, causing immediate context overflow errors (#85921).
maxHoldMs with lane timeouts and implement PID-based stale lock detection to prevent session bricking (#86004, #86311).api.anthropic.com (#87181).codex app-server children to stop orphan accumulation (#86316).chrome-devtools-mcp process trees are fully terminated on session close (#85721).