Why most AI agents fall apart in real work (and how to fix it)
The agent that nailed your demo is not getting dumber. It's running out of context. The fix is not a smarter model; it's setting the agent up to succeed.
You watched it in the demo. The AI agent read the file, ran the task, came back with something sharp. You handed it a real multi-day job and it lost the plot somewhere around step nine. Forgot a decision you made on Monday. Invented a number. Confidently finished the wrong thing.
The reflex is to wait for the next model. That's the wrong diagnosis.
The honest answer
It's the context, not the model.
Every agent works inside a context window: the slice of text it can actually see at one moment, measured in tokens. As a long task runs, that window fills with tool output, half-finished steps, and earlier reasoning. Old decisions scroll out. The agent stops seeing what it agreed to on step two by the time it reaches step nine, so it guesses. It never had a written brief about your business, so it guesses there too. Small guesses compound into a broken result.
A bigger model does not fix that. The model was already smart enough in the demo. What broke was the information the agent could see when it mattered.
Why it's harder than it looks
The research lands in the same place from three directions.
METR found the length of task an agent can finish has been doubling roughly every seven months. Impressive, until you read the fine print: that horizon is measured at a 50% success rate. A coin flip. Models hit near-100% on tasks a human would do in under four minutes, and under 10% on tasks over four hours. The doubling is real. So is the coin flip.
Toby Ord's "half-life" framing explains the feel of it: an agent has a roughly constant chance of failing each minute a human would spend, so success drops exponentially as the job gets longer. Great in a five-minute demo. Underwater on a multi-hour job. Same model, same minute, different odds at the end.
A 2026 reliability study across 23,392 task episodes showed the same shape: reliability decays faster than the task lengthens. A short job that succeeds around 76% of the time slides toward 52% at the longest horizons. Failure rises faster than the work grows.
And bigger windows are not the escape hatch. Independent testing of around 18 frontier models found they all degrade as the input grows, well before they hit their maximum window, and none of them use long context evenly. As one builder put it on X: "The model is not the bottleneck anymore, context is." A million-token window is not a filing cabinet you can dump everything into and trust. Quality can drop as you fill it. This is "context rot," and it means bigger is not automatically better.
What to do this week
You don't need a better model. You need to set the agent up so the context it needs is in front of it and the context it doesn't is out of the way. Five moves, in order of payoff:
- Give it persistent memory. A fresh chat forgets everything every time. Memory carries your decisions and preferences across sessions, so the agent stops re-deriving what it already settled.
- Write it a brief about your business. A short CLAUDE.md file that says who you are, your terms, your preferences, your guardrails. It stops guessing the things only you know.
- Narrow the job. One bounded task with a clear "done" beats a sprawling multi-day mandate. A smaller job fills less of the window and fails less.
- Keep a human checkpoint. Put yourself between the agent and anything irreversible: sending, paying, deleting, publishing. The checkpoint catches the compounded error before it ships.
- Break long tasks into smaller verifiable steps. Each step starts with a cleaner window and a result you can check, instead of one long run that quietly drifts.
As another builder put it on X: "Your AI agent is not getting worse because the model is dumb. It is getting worse because the context is polluted."
Want to maximize your AI leverage? Upgrade to Pro.
Before you blame the model, do one thing. Write the agent a one-page brief about your business, give it memory, and re-run the task that fell apart. Most of the time, the same model gets it right.
Related
- What is a context window?
- What is an AI agent?
- AI agents vs AI assistants
- Persistent memory across Claude Code sessions
- What is a CLAUDE.md file?
The signal in your inbox, every Thursday
The Thursday 3 is a free weekly email. Three workflows that put you in the top 1% of CEOs. 90-second read.
Get the newsletter →The architecture behind these articles.
Two operator manuals for the same job, run two ways: OpenCLAW for the always-on harness, Claude Code for the focused-work CLI. Pick one, or get the bundle for $149.
Browse the books · $99 eachWant one workflow taken apart end-to-end every week? The Tuesday Pro Deep Dive · $39/mo.