Recovery in vibecoding tools: a four-level escalation ladder
RapidNative is text to app. A user types a prompt, our agent generates the code, a browser-based runtime executes it live, the user iterates, eventually they ship. That entire flow is one long async chain, and every link can fail.
Before we built our recovery orchestrator, we'd watch the same story play out a dozen times a day:
- A user's wifi drops mid-build. The
npm run buildwe drive through opencode hangs. The user assumes RapidNative is broken. - The model emits a JSX file with a bad import. The browser runner catches the error, but the agent never sees it. The user clicks "fix with AI", which patches around already-broken state.
- The fix-with-AI itself goes async wrong. Two patches collide. The file ends up worse than where we started.
- The user keeps clicking. Five retries deep, the foundation is unrecognizable. Each new fix is patching scaffolding that should never have shipped.
Three failure modes interacting: flaky internet, wrong AI input, runtime errors that don't propagate back to the agent. Plus the meta-failure of recovery itself going async wrong.
The instinct is to handle each in isolation. Retry the network call. Validate AI output. Stream runtime errors back to the planner. We did all of that. It wasn't enough, because they cascade. A network blip during a fix-with-AI leaves a half-written file. A half-written file produces a runtime error the model doesn't expect. Now the next "fix" is patching damage from the previous fix.
So we built a timeline.
The timeline
The timeline is a single durable log of every meaningful event in a session: prompt sent, file written, build started, build finished, build failed, runtime error caught, fix-with-AI requested, fix-with-AI completed, page reloaded. Each event has a category, a payload, a timestamp. The timeline persists in session storage so a refresh does not lose state.
The orchestrator does not read the latest error in isolation. It reads the shape of the last few minutes. "Have I tried fix-with-AI on this file twice in the last 90 seconds and still seeing the same runtime error?" That is a different question from "is the latest build broken?", and it leads to a different action.
The deterministic part of this is important. The orchestrator is a pure function of the timeline: given the same sequence of events, it always picks the same recovery level. The async chaos lives in the surfaces. The decision over what to do with that chaos is boring, repeatable, and easy to test.
The four-level escalation ladder
The orchestrator escalates through four levels. Each is cheaper to recover at than the next.
Level 1: file edit
Default path. Apply the fix-with-AI patch directly to the failing file. Cheap, fast, scoped. If it works the user barely notices.
Level 2: renderer reload
If the file edit succeeded but the runtime is still showing the old error, the issue is not the file. It is the iframe. Reload just the renderer. Same project, same files, fresh runtime.
Level 3: page reload
If the renderer reload does not clear it, something stickier is wrong. Stale service worker, lingering global state, the editor and runtime out of sync. Full page reload. Session storage carries the timeline through.
Level 4: start the page from scratch
If we hit two consecutive fix-with-AI failures on the first message of a project, we don't keep trying. We delete the project entirely and recreate it from our base template using the original prompt, which is in session storage. Fresh start, same input. Most of the time the second create just works.
That last level is the interesting one. The instinct in any retry system is to dig deeper. Smarter prompts. More context. Harder hints to the model. But the bug is often not in the patch, it is in the foundation. The cheapest fix at that point is a fresh project, not a fifth retry.
Why this generalizes
We built it for RapidNative. The shape applies anywhere you are running untrusted user code in an async loop:
- AI coding tools, the obvious one
- Live notebooks running user code in a sandbox
- CI systems re-running flaky test suites
- Headless browsers driven by an agent
- Any system where one user input touches multiple async surfaces, model + build + runtime + preview
The pieces:
- A timeline that captures every event durably across reloads.
- An escalation ladder, ordered cheapest to most disruptive.
- A clear rule for when each level fires, based on the timeline shape, not on a single error.
If your retry logic looks at one error in isolation, you will pile fixes on broken foundations. If it looks at the timeline, you can tell when to stop patching and start over.
What we got wrong before this
For the first three months of RapidNative, we treated each failure type as its own bug. Network errors had their own retry. Build errors had their own retry. Runtime errors had their own retry. The recovery code was scattered across the codebase, and none of the layers knew about each other.
A single user session could rack up fifteen retries across four layers, each one patching state the others had silently corrupted. The user's experience was a slow march toward a broken app they couldn't escape, with a "fix with AI" button that kept making things worse.
The timeline collapsed all of that into one place. The escalation ladder gave us one decision tree. Recovery went from scattered retry logic to one orchestrator reading one log.
It is not a glamorous engineering problem. There is no model to train, no benchmark to beat. But it is the difference between a vibecoding tool that feels magical and one that feels haunted.