Glauser Creative
Closing the loop: how AI software starts improving itself
← Back to Thoughts

Closing the loop: how AI software starts improving itself

Here’s what my workflow looks like today when I start building something new.

I open the project, drop into planning mode, and have a long back-and-forth with the AI about what we are actually trying to do and how to approach it. Once the plan is solid, I let the AI write the code. Then I run it locally, click around, and check that things behave the way I expected. I ask the AI to do a code review on its own work. There is a CI pipeline that runs tests on every push. GitHub Copilot drops a comment on the pull request with its take. Eventually I merge and ship.

This is dramatically better than how things worked even a year ago. Most of the busywork has dissolved.

But it isn’t a closed loop. There is still a human stitching it together at every step.

The bottleneck isn’t the mistakes

I keep hearing people say the reason we can’t trust AI to ship software is because it makes mistakes. I don’t think that’s the right framing.

Of course it makes mistakes. So do humans. Every developer I’ve ever worked with has shipped a bug. The difference is that humans have a decent feedback loop. We deploy something, watch the dashboards, get a Slack message from support, look at a user complaint, and adjust. The mistake gets corrected because the loop closes.

AI doesn’t have that loop. Not in any complete sense. It writes code, runs the tests it wrote, then waits for a human to come back and tell it whether the thing actually solved the problem. The verification step is missing the world.

“The only way to win is to learn faster than anyone else.”

Eric Ries, author of The Lean Startup

That quote comes from a different era of building software, but it applies more now than it ever did. The team that learns fastest wins. And right now, the slowest part of every AI-assisted workflow is the human at the end of the chain, manually reading dashboards and translating them back into prompts.

What today actually looks like

Here is roughly how my current loop looks if I draw it out.

Today's development workflow. Linear flow with a broken feedback loop.Plan with AIAI writes codeTest & reviewsmanual + CI + AIDeployReal usersFeedback comes back manually.Slow. Lossy. The loop never closes.

How my workflow looks today. Linear flow, broken loop.

The dotted arrow looping back to the top is dotted because that path isn’t really there. It exists only because I am there. I read the analytics. I read the support tickets. I notice the pattern. I open the project and start a new prompt. The AI has no idea any of that happened until I tell it.

Some of this is fine. There are decisions that should stay in a human’s hands, especially the ones with real consequences. But the routine stuff, the “users are dropping off at this step, fix it” kind of feedback, has no business going through me. That part should just happen.

What a closed loop actually looks like

Strip the workflow down and there are really only four moves.

1. Signals in. Everything the world is telling you about whether the software is doing its job. App store reviews. Server logs. Error monitoring. Conversion funnels. NPS responses. Churn cohorts. Live chat conversations from Intercom or Crisp where users describe their frustration in their own words. And increasingly, signals from the AI testing the product itself, walking through the UX, taking screenshots, and flagging anything that looks confusing or broken. These are the eyes of the system.

2. Prioritisation toward a goal. This is where most teams already break down, even without AI. You have to pick what you are optimising for. Revenue. Retention. Time to value. Health outcomes. Carbon saved per user. Whatever. The goal is the function the system is going to maximise, and getting it right is the most important decision you will make in this whole setup.

A vague goal produces a confused product. A narrow goal optimises one number while everything else slowly dies. “Increase engagement” sounds harmless until you end up with a slot machine. “Maximise time on app” is how you build something people resent. “Drive conversion” is how you bury the user under dark patterns. The goal needs to be specific enough that the system can act on it, but rich enough that it captures what you actually care about. Sometimes that means combining several signals: a primary outcome, a couple of guardrails it must not break, and a sense of the kind of experience you want users to have.

In the closed loop, defining the goal is the human’s main job. Not writing tickets. Not reviewing PRs. Not checking dashboards. Deciding what good looks like, and then refining that definition as you learn more. The system will deliver exactly what you ask for, so you’d better be sure you’re asking for the right thing.

3. Build and self-test until it works. Not “AI writes a draft, human checks it.” The AI plans the change, writes the code, runs the test suite, then actually opens the app using computer use, clicks through the relevant flows, takes screenshots of every state, and looks at them with its own eyes to verify the change behaves correctly. It also replays real production traffic against the new version to catch the weird corner cases that synthetic tests always miss. The ones where a user did something nobody on the team thought to test. If something is broken or confusing, it fixes it and runs the loop again. It only exits when its own success criteria are actually met. This is the part everyone keeps tripping over. Today the AI almost always hands back the work too early, because the only judge is the human.

4. Ship and observe. Deploy. Then go back to step one with fresh signals.

That’s it. Four moves. Here’s what that looks like as a picture.

Optimal closed loop. Signals flow back into the system automatically.Signals inusers · logs · reviewsPrioritisetoward the goalBuild & self-testuntil it worksDeployThe loop closes. Each cycle makes the system better.

A complete loop. No human bottleneck in the verification step.

This isn’t a new idea. Eric Ries called it Build, Measure, Learn fifteen years ago. Toyota called it kaizen sixty years before that. The shape of the loop is ancient. What’s new is that for the first time, every single one of those four moves can be done by a machine that doesn’t get tired and doesn’t take weekends off.

A note on paperclips

The obvious worry is that you build a system that optimises perfectly for the wrong goal. The famous version of this is Nick Bostrom’s paperclip maximiser, the AI that turns the universe into paperclips because that was the objective you set.

The risk is real, but it puts the responsibility exactly where it belongs. The thing that matters most about a self-improving system is the goal you give it. “Make money” produces one kind of product. “Be the best app for managing chronic illness” produces a very different one. “Reduce a household’s energy bill by 30% without making them miserable” produces a third. Pick carefully. The system will be honest about delivering exactly what you asked for, and that is what makes it both scary and useful.

What this unlocks

I am not interested in paperclips. I am interested in what happens when you point a closed loop at a problem that actually matters.

The best app for managing your health, the kind that learns from millions of users and gets smarter every week. A marketing tool that actually moves the needle for small businesses instead of pretending to. The energy app that helps a family halve their bill without nagging them. The childcare scheduling tool that actually accounts for how parents and kids really live. The legal helper that levels the playing field for people who can’t afford a lawyer.

None of these are research problems. They are system design problems. The reason they don’t exist yet isn’t that the model isn’t smart enough. It’s that nobody has wired up the loop.

So what would it actually take?

Here is the practical answer, as of April 2026.

  • A capable model with tool use and a long enough context to hold the project state. We have those. Claude Code is one example I use daily.
  • Programmatic access to your signal sources. Analytics API. Error monitoring API. App store reviews API. Intercom or Crisp conversation export. Support ticket data. Most modern tools already expose these.
  • A test environment the AI can run against. Unit tests, integration tests, end-to-end tests that exercise real user flows, and computer use so the AI can actually open the app, click through it, and screenshot what it sees.
  • Real production data, properly anonymised, that the AI can replay against new versions. Synthetic test cases miss the weird corner cases that only show up when actual users do unexpected things. The hardest bugs are the ones nobody thought to write a test for.
  • A deployment path the AI can actually trigger. CI/CD that doesn’t require a human button press for the routine stuff.
  • A clearly defined optimisation target. The goal. Not a fuzzy mission statement, an actual measurable thing the system can act on. Ideally a primary metric paired with a few guardrails so it can’t game its way to a number that looks good but feels awful. Something like “increase the number of users who finish checkout, without raising the refund rate and without dropping the app store rating” is workable. “Make users love the product” is not. This is the human’s main job in the new world, and it deserves more thought than most teams give it.
  • Approval gates on the irreversible actions, not the routine ones. Don’t let the AI delete the production database. Do let it ship a copy fix.

That’s the entire stack. None of it is exotic. None of it is research. The hard part isn’t the model. The hard part is gluing the pieces together and trusting the loop to run.

Where I’m headed

I’m building toward this in my own work, slowly. Today I have planning mode and code review and tests, which is the left half of the picture.

I recently built a small test program that lets the AI walk through an entire app on its own, take notes about what it sees, and capture screenshots of every state. It works surprisingly well. The AI clicks around like a real user, writes down what worked and what didn’t, and leaves me a stack of screenshots to review. It’s still a standalone tool, not wired into anything else. But it’s the part of the loop where the AI starts seeing the world with its own eyes instead of waiting for me to describe it. Once that gets connected to the build step on one side and the signals layer on the other, the loop starts to close on its own.

Other people are circling the same idea. paperclipai/paperclip is one early project worth a look. It isn’t exactly the closed loop I’m describing here, but it’s clearly thinking about the same set of problems. The name alone tells you that.

The next thing I want is a real signals layer feeding back in. App store reviews and analytics events going straight into the AI’s context, not into a Notion doc I will read on Tuesday. I want the AI to wake up in the morning and know what shipped, what broke, what users said about it, and what to do next. I want to be the person setting the goal and approving the big calls, not the person manually translating dashboards into prompts.

I’ve written before about practising AI by doing rather than studying it, and about the power of quick feedback loops in design. This post is the third corner of that triangle. Practise the thing, build a fast loop, then automate the loop itself. That is the whole game from here.

The intelligence explosion isn’t waiting on a research breakthrough. It’s waiting on someone to wire up the loop. Whoever does that first is going to ship software that, week after week, just gets better than anything a quarterly release cycle could ever produce.

That’s what I’m working toward. Figuring it out one piece at a time.

More thoughts

See all

Cases

See all