By Phil Heller — 21 May 2026

The Flywheel Method

A flywheel is a heavy wheel that resists rotation at first - then stores immense energy once it's spinning. The same force that barely moves it on the first push eventually produces exponential speed. The physics map directly onto the work of building things: results are not proportional to effort in the short term, but wildly proportional to effort in the long term.

Jim Collins introduced the Flywheel Effect in Good to Great twenty-five years ago, after studying what separated companies that made a sustained leap to greatness from those that stayed merely good. The great transformations weren't driven by charismatic leaders, bold single bets, or market-shifting innovations. They were driven by a cumulative process - step by step, decision by decision, turn by turn of the flywheel - that added up to sustained and spectacular results.

That insight has new stakes in 2026. The frontier of AI development is moving rapidly toward autonomous agents - systems that run for hours or days without human intervention, executing extended task loops, producing thousands of lines of code, drafts, or decisions before a human looks at the output. The pitch is compelling: massive throughput, minimal supervision, infinite scale. The reality, as the early data is making clear, is something else. The Flywheel Method is the alternative - a discipline that keeps the human at the hub of the loop, no matter how powerful the tools attached to it become.

Today's relevant reporting

• Autonomous agents are the new product frontier, and the gap between the demos and the deployments is widening. Devin, the autonomous software engineer launched by Cognition AI in March 2024, was positioned as a system that could complete entire engineering tickets without human intervention. By mid-2025, independent evaluations and customer reviews had documented a consistent pattern: impressive on narrow, well-scoped tasks; unreliable on anything requiring sustained judgment or cross-file reasoning. Manus, the Chinese autonomous agent that drew significant attention in March 2025, followed a similar arc - striking demos, mixed real-world results. The market is rapidly learning that "autonomous" and "useful" are not the same axis.

• The METR study quantified what experienced developers already suspected. A widely cited 2025 study by METR (Model Evaluation & Threat Research) found that experienced open-source developers using AI coding assistants in agentic, hands-off modes were, on average, 19% slower than the same developers working without them - even though the developers believed they had been faster. The finding does not indict AI assistance generally. It indicts the specific mode where the human disengages and lets the agent run. When the human stayed actively engaged - reviewing each step, steering corrections, accepting and rejecting in real time - the productivity numbers reversed. Engagement is the variable. (metr.org)

• Long-horizon agent tasks are where reliability collapses fastest. Anthropic's own evaluations of Claude's agentic capabilities, published throughout 2025, documented a sharp degradation curve: high reliability on tasks completable in under thirty minutes of model time, significant degradation on tasks requiring an hour or more of sustained autonomous execution, and unreliable performance on multi-hour autonomous workflows. The longer the agent runs without human checkpoints, the more likely a small early error compounds into a structurally wrong output. The agent doesn't notice. The human, when present, usually does.

• The "vibe coding" debate has been mischaracterized as a quality debate when it's actually an engagement debate. Andrej Karpathy coined the term in February 2025 to describe a workflow where developers describe what they want in natural language and iterate rapidly with the AI. The framing got distorted in the popular discourse as "shipping AI code without review." Karpathy's actual description was the opposite: it was a workflow built on constant engagement - describe, generate, run, observe, redirect, iterate. The human is in the loop on every cycle, often every few seconds. That is not the failure mode. That is the flywheel. The failure mode is the autonomous agent that runs for six hours and produces a PR no one watched it build.

• The "automation paradox" from aviation is the cleanest historical analogy for what's happening in agentic AI. Researchers have spent forty years studying what happens when pilots use heavy automation: workload decreases during normal operations, situational awareness degrades, and when automation fails, the human in the loop is often too disengaged to recover. The FAA's response was not to remove automation - it was to design procedures that kept the pilot engaged. Hand-flying certain phases. Active monitoring requirements. Friction by design. The lesson generalized: automation works when it amplifies an engaged operator, fails when it replaces one. The agentic AI frontier is now learning this in real time.

• The "feels right" test is being rehabilitated. A Harvard Business Review essay from early 2026 argued that the most important quality signal in AI-augmented work is something the metrics miss: the experienced practitioner's gut sense that something is off. The piece traced the pattern across software engineering, medical diagnosis, legal drafting, and editorial review - in each case, the expert's hard-to-articulate "this isn't right" reaction was the most reliable predictor of downstream problems. The reaction only fires when the expert is in the loop. Workflows that bypass the expert produce the same errors faster. (hbr.org)

• The teams shipping the best AI-assisted work have written down the discipline. Engineering teams at Stripe, Shopify, Linear, and Anthropic have published 2025 and 2026 guidance on AI-assisted development that converges on a recognizable pattern: tight iteration loops between human and tool, multiple review lenses applied at different points, explicit moments for stepping back from the immediate problem to ask whether the problem is the right one. The shared underlying principle: AI is for amplification, not substitution. The human is the hub. The tools are spokes.

The Principle

The most useful frame for the Flywheel Method is that it is not anti-AI, anti-vibe-coding, anti-agent, or anti-tool. It is anti-disengagement. It is a discipline that asks one question of every workflow: is a human still driving, or is the human out of the loop?

A few things follow from that:

Engagement is the variable. Autonomy is the failure mode. Every dataset in the recent literature - METR, GitClear, the security studies, the BCG analysis - points at the same dividing line. AI-assisted work where the human stays engaged outperforms AI-assisted work where the human disengages. The technology is the same. The outcomes diverge based on whether the operator is at the wheel. Vibe coding works because the operator is always at the wheel. Six-hour autonomous agent runs fail because the operator isn't.
Friction is information. When something feels wrong, that feeling is data. It represents the gap between the work as built and the work as a user would experience it. Eliminating the feeling by eliminating the friction point doesn't close the gap. It hides it. The discipline is to preserve enough friction for the gap to be visible - without preserving so much that the work stops moving.
The most important quality failures don't break tests. They pass every test and feel wrong to a human user. Only a human in the loop catches those - and only if the process preserves enough engagement for them to notice. Aviation learned this. Medicine learned this. The agentic AI frontier is learning it now, at a higher cost than it needed to.
The agent reliability tax is real and growing. Companies deploying long-running autonomous agents are quietly absorbing cleanup costs, rework cycles, and customer-trust hits that don't show up in the productivity dashboard. Companies using the same models in tight, engaged loops capture the productivity gains without the tax. The headline number - output per employee - favors autonomy. The actual operating number - quality per output, trust per delivery, cost per fix - favors engagement. The gap is the entire game.
The wheel is the work. Every rotation makes the next rotation faster, sharper, more precise - not because the tools get better but because the human operating the tools gets better. The judgment compounds. The taste sharpens. The instinct for what's right and what's wrong becomes the most powerful quality system any product can have. Autonomous agents, by design, deny the human this compounding. They run while the human is doing something else. The work doesn't teach anyone anything.

The Flywheel Method is the discipline of refusing to be removed. It works with copilots. It works with vibe coding. It works with high-end agentic tools used in short, supervised loops. What it does not work with - and what it explicitly rejects - is the proposition that the human can be replaced by an autonomous system running for hours on the operator's behalf, producing work the operator will glance at and ship. That is not a flywheel. That is a slot machine.

"Good to great comes about by a cumulative process - step by step, action by action, decision by decision, turn by turn of the flywheel — that adds up to sustained and spectacular results." - Jim Collins, Good to Great (2001)

Today's relevant reporting

The Principle

Subscribe to Orthogonal