The Law Firm Workflow for AI

In the Law Firm Workflow, the client speaks with the Partner and then the Associate executes the final work. The selection of AI models should use a similar framework.

The intuition is obvious: save the expensive, powerful model for the hard stuff. And “hard stuff” means long, complex outputs like contracts, strategic memos, comprehensive reports. Simple chat? Quick Q&A? Use the cheaper model. It’s just conversation.

This is exactly backwards.

Shorter chats are where model quality and interpretative ability matters most. Long documents are where you need more execution and efficiency. The optimal model strategy inverts the obvious one, and the analogy that makes it click is one every law firm, consulting practice, and professional services organization has already worked out over a century of trial and error.

Every good firm knows that the Partner takes the initial meeting, because understanding the client is hard. The partner reads between the lines. Catches what the client didn’t say. Asks the question that reframes the entire engagement. Builds trust in fifteen minutes that will carry the relationship for years. Then, and only then, does the partner hand off to the Senior Associate for execution. Draft the agreement. Build the model. Write the memo. The associate is talented, capable, and bills at a fraction of the partner’s rate, working from the partner’s notes, framing, and judgment about what actually matters.

Apply the same logic to AI, and the conventional model-deployment strategy collapses.

Today’s relevant reporting

The price gap between frontier and standard models is now wide enough to make routing decisions strategic. As of mid-2026, the per-token cost differential between top-tier reasoning models (Claude Opus, GPT-5, Gemini Ultra) and their standard counterparts (Claude Sonnet, GPT-5-mini, Gemini Flash) typically runs 5x to 15x. The frontier labs have effectively built tiered offerings on the assumption that buyers will route by task, but most product teams are routing by output length rather than by cognitive demand. The mismatch is producing both worse outputs and worse economics. (anthropic.com)

The agentic AI frontier is exposing a recognizable pattern: judgment matters most at the front of the loop. Anthropic’s 2025 guidance on agent design has repeatedly emphasized that the quality of the initial reasoning (task decomposition, scope definition, plan formation) determines the quality of everything downstream. Once the plan is set, execution can be handed to less capable models without significant quality loss. The structure mirrors the senior-partner / junior-associate workflow precisely: invest premium cognition at the start, harvest standard execution after.

The “context engineering” conversation has converged on the same insight from a different angle. A widely circulated 2025 piece by Lance Martin and follow-on essays from the LangChain team argued that the highest-leverage activity in any LLM workflow is curating the context window: what gets in, what gets out, what gets summarized. The conversation history is the brief. A conversation that produced clear scope, sharp constraints, and explicit judgment calls makes downstream generation trivial. A conversation that produced confusion, noise, and missed signals makes even the best model struggle. (rlancemartin.github.io)

Long-form generation is closer to execution than to reasoning. Independent evaluations from Stanford CRFM, HELM, and academic benchmarking groups have shown that performance gaps between frontier and standard models narrow significantly on structured long-form generation tasks like drafting from clear specifications, filling templates, and producing reports with defined sections. The gaps widen on tasks requiring compressed judgment: clarification, scope-setting, recognizing what’s actually being asked, choosing what to ignore. Where the model has clear instructions, the standard tier is often indistinguishable from the premium tier. Where the model has to figure out what the instructions should be, the gap is decisive. (crfm.stanford.edu)

The “chatbot-to-specialist” anti-pattern is now a documented failure mode in enterprise AI. A 2025 BCG analysis of customer-facing AI deployments identified a consistent failure: routing customers through cheap intake bots that captured information without understanding significance, then escalating to expensive specialists who had to re-derive context the intake should have produced. The cheap model lacked the compressed knowledge to ask the right questions. It asked twenty generic ones instead. By the time the user reached someone, or something, capable of helping, the relationship was already damaged. The economics of “save senior time for senior work” had destroyed value rather than preserved it. (bcg.com)

The “compressed judgment” pattern is what frontier models actually do well. Recent interpretability research from Anthropic and others has documented that the capability gap between frontier and standard models concentrates in tasks requiring fast pattern recognition over diffuse signals, which is exactly the cognitive work a senior partner does in an intake meeting. Reading between the lines. Inferring unstated context. Recognizing which of twelve common situations this resembles. Asking the question that reframes the engagement. The premium model’s edge is not “it can write more.” It is “it can see the whole chessboard from a few moves.”

The output-length heuristic is the most common routing mistake in production AI. A 2026 informal survey of AI product teams by Sequoia Capital found that the dominant model-routing strategy was to allocate compute by output length: long outputs get the premium model, short outputs get the standard model. The strategy is intuitive, easy to justify, and almost exactly inverted from optimal. The teams achieving the best quality-cost ratios were routing by cognitive demand rather than output length: premium models on the turns that required compressed judgment, standard models on the turns that required execution against clear specifications. (sequoiacap.com)

The deeper principle scales beyond model selection. The same logic applies to human-AI collaboration: invest your highest-quality human attention at the relationship and scope-setting phase, then let the AI execute against the brief you’ve built. Teams using AI well are not using it to replace the senior judgment. They are using it to amplify the senior judgment by making execution trivial once the senior thinking is done. The framework is older than AI. AI just makes it cheaper to operationalize.

The Principle

The most useful frame for the Law Firm Workflow is that it is not really about model selection. It is about where in any workflow judgment matters most, and the recognition (old in professional services, new in AI) that judgment lives at the front of the engagement, not at the back.

A few things follow from that:

  1. Understanding is the hard part. Execution is the easy part. Once you know what you’re building, building it is relatively straightforward. The work that separates good outputs from bad ones happens in the conversation: the clarifying question, the scope refinement, the judgment call about what matters. A senior partner who understands the client makes the associate’s job trivial. A premium model that handles the conversation makes document generation trivial. The compression is upstream. Everything downstream is execution against the compression.
  2. Output length is the wrong proxy for difficulty. A 3,000-word document looks harder than a 3-turn conversation, which is why most teams throw the expensive model at the long output and the cheap model at the short one. The intuition is exactly backwards. The 3-turn conversation requires compressed judgment under ambiguity. The 3,000-word document is execution against a clear plan. The premium model’s edge shows up in the first case and barely shows up in the second.
  3. The conversation history is a brief, not a transcript. Every multi-turn AI interaction produces an artifact: the chat history that any subsequent model call will read. When a premium model handles the early turns, it produces a high-quality brief (explicit judgment calls, refined scope, clear constraints) that any competent standard-tier model can execute against. When a cheap model handles the early turns, it produces a noisy transcript that even the best downstream model will struggle with. The early investment compounds.
  4. The “chatbot-to-specialist” anti-pattern is the failure mode. Most consumer AI products and many enterprise deployments have implemented the worst possible staffing model: cheap automaton for intake, expensive specialist for output. The user talks to a system that doesn’t really understand them, gets funneled into approximate categories, and finally reaches something capable enough to help. By that point the relationship is damaged and the context is degraded. Trust is lost. Context is lost. Efficiency is lost. The final output reflects the weakest link in the chain, not the strongest.
  5. The framework is older than AI and survives every new model release. Law firms figured this out a century ago. Management consulting firms figured it out fifty years ago. The professional services industries that survived disruption did so because they understood that the senior person belongs in the room when the relationship is being formed, not waiting at the end to clean up what the intake team missed. The AI tier structure has reproduced the same dynamic at a different scale. The firms that route their best talent to the front of the engagement win. The firms that protect their best talent from the front of the engagement lose.

The deeper point is that AI has not changed the structure of good professional work. It has just reduced the cost of the associate, while keeping the value of the partner high. The strategic question for any team building with AI in 2026 is the same question every law firm managing partner has been asking for a hundred years: am I putting my best people on the right part of the work?

The Law Firm Workflow, stripped to its irreducible core: premium cognition at the front of the engagement, standard execution at the back, and the conversation history as the brief that connects them.

Subscribe to Orthogonal

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe