Scaffold: Teaching a Model to Disagree With Itself
The hard part of building a tool that turns a vague engineering problem into real options wasn't the UI or the data model. It was getting a language model to produce three genuinely different answers instead of the same answer three times.
I built Scaffold to fix a specific failure I kept watching happen on engineering teams: the meeting where someone says "we should just build X," everyone nods, and three weeks later it's clear X was never the question. The problem was never pinned down, the alternatives were never on the table, and the decision — such as it was — lives in a Slack thread nobody can find.
Scaffold is a thinking layer that sits in front of the code. You hand it a messy problem and it walks you through four stages — Frame, Clarify, Diverge, Decide — and out the other end you get 3–4 genuinely different ways to solve it, the tradeoffs of each, and a clean Markdown decision brief you can paste into a doc and point a stranger at six months later.
This post isn't a feature tour. It's a note on the one problem that turned out to be the whole project, the design decision that made the rest fall into place, and the thing I'm still not sure about.
The problem behind the product
Here's the observation that started it. Give two engineers on the same team the same problem and the same model, and they'll ship wildly different answers. The gap isn't the tool — they're both using the same one. The gap is the structure of the input. One of them framed the problem tightly and got something useful on the first try; the other typed two sentences into a chat box and got plausible-sounding mush.
So Scaffold's bet is that the leverage is in forcing structure up front, not in the model. The model is a commodity. The intake is the product.
That sounds clean. It made the first stage easy to design and the third stage extremely hard to build.
Four stages, and where it actually got difficult
Frame is five fields: problem statement, definition of done, constraints, engineering capacity, non-negotiables. That's the whole forcing function. If you can't fill in "definition of done," you don't have a problem yet, you have a feeling — and that's useful signal, not a bug. The five fields are deliberately the smallest set that makes the downstream stages honest.
Clarify asks 2–4 questions back. The rule I gave the model is the entire value of the stage: every question must be decision-relevant — the answer has to change which approach gets recommended. No "tell me more about your goals." A good clarifying question for "our search is slow" is "can you accept eventual consistency, or must newly-written records be immediately searchable?" — because the answer is the fork between async indexing and not. Vague intake produces sharp questions, which is its own kind of feedback.
Decide takes your pick and your one-paragraph rationale and generates the brief: problem, options considered, decision, tradeoffs accepted, open questions. The artifact, not the conversation.
And then there's Diverge, stage three, which is the one that nearly didn't work.
The hard part: making a model give you genuinely different options
Ask any language model for "a few options" and it will give you the same option at three resolutions. For "our search is slow" you get: use Redis. Use Redis with a connection pool. Use Redis Cluster. Three answers, one idea. That's worse than useless in a decision tool, because it looks like a choice while quietly removing the actual choice.
The default behavior of these models is to converge. They're trained to find the most probable continuation, and the most probable second option is a neighbor of the first. Getting real divergence means fighting the grain of the thing.
What worked was refusing to leave "different" as an abstraction and naming the axes a solution is allowed to differ on. The generation prompt forces every option to diverge on at least one of four concrete dimensions:
- Architectural pattern — sync vs. async, event-driven vs. request-response, CQRS vs. plain CRUD.
- Build vs. buy — custom code vs. managed service vs. off-the-shelf product.
- Scope of change — surgical fix vs. incremental migration vs. full replacement vs. strangler fig.
- Sequencing — big bang vs. phased vs. parallel run vs. feature-flag rollout.
Then — and this mattered more than I expected — I gave it the bad answer explicitly. The prompt contains an anti-example: "Use Redis / Redis with a pool / Redis Cluster — these are the same approach at different scales." Telling the model what failure looks like did more to prevent it than any amount of telling it to "be creative." Naming the trap is more effective than describing the goal.
The good set, by contrast, is laid out for the same problem: add Elasticsearch alongside Postgres (new layer, build-vs-buy), rewrite queries and add targeted indexes (surgical, no new infra), move to managed search like Algolia (full buy), materialize read models via event sourcing (pattern change). Four answers, four genuinely different bets, each landing in a different place on cost, risk, and reversibility.
Two few-shot examples — a slow deploy pipeline and adding multi-tenancy to a single-tenant SaaS — anchor the format. The examples aren't there to teach the model what a pipeline is. They're there to teach it the spread: how far apart two acceptable options should sit.
Structured output is doing quiet load-bearing work
Every model call in Scaffold uses strict JSON-schema structured output, not free-text-then-parse. Each solution comes back as a typed object: title, core approach, a tradeoffs array, complexity signals (systems touched, migration risk as a low | medium | high enum, auth/data implications), and a "best fit when" clause.
This isn't only about not writing a brittle parser, though it is also that. The schema is a second forcing function aimed at the model instead of the user. By requiring migrationRisk as an enum and systemsTouched as a concrete list, the structure makes it expensive for the model to be vague. It can't hand-wave "this introduces some complexity" when the schema demands the actual systems it would touch. The shape of the output constrains the quality of the thinking, the same way the five intake fields do for the human. Same trick, both ends of the pipe.
The instructions lean on this hard. Tradeoffs must be specific — "adds a new Kafka cluster to operate and monitor," not "introduces operational complexity." bestFitWhen must be "when the team has fewer than two engineers and needs to ship in two weeks," not "when you want simplicity." Vague is the enemy at every stage, and the schema is where I get to enforce it mechanically.
The data model is the spec
The Postgres schema (Drizzle over Supabase) mirrors the stages exactly: projects own problem_inputs, clarifications, solutions, and decisions. Solutions carry a version column so a regenerate is a new version, not a destructive overwrite — you can ask for a fresh set of options without losing the set you didn't like, and the regeneration prompt is fed the previous titles explicitly so the new batch is told, in writing, not to repeat them. Divergence enforced across calls, not just within one.
The decision brief gets stored as rendered Markdown on the decisions row. That was a deliberate call: the brief is the durable artifact, the thing that outlives the session, so it's persisted as the exact text you'll read rather than re-rendered from parts each time. The decision is a record, and records shouldn't shift under you.
Ownership is enforced in the application layer rather than via a foreign key into Supabase's auth.users, because Drizzle's introspection lives in the public schema and I didn't want to fight that boundary. It's the kind of small, unglamorous decision that you only document so the next person doesn't "fix" it.
What I cut
- A model abstraction layer. Scaffold talks to one provider through a thin client. It's tempting to build a swappable multi-provider abstraction on day one; it's also building for a requirement I don't have yet. The client is a few lines behind a lazy proxy, easy to replace when there's an actual reason to.
- Streaming the options in. Watching the four options type themselves out is a nice demo. It's also latency theater for a tool people use a handful of times, where the brief matters more than the animation. The options arrive as a set, which is also more honest — you're meant to compare them, not read them in order.
- Open-ended chat. Scaffold is on rails by design. The whole premise is that unstructured input is the disease; adding a "just chat with it" escape hatch would reintroduce exactly the thing the four stages exist to prevent.
- Heavyweight rate limiting. Open beta needed a spend ceiling, not an infrastructure project. Two constants — five projects per user, three regenerations per project — bound worst-case cost. It's a beta limit, not an architecture, and it's labeled as one in the code so nobody mistakes it for one.
What I learned
Naming the failure mode beats describing the goal. The single biggest quality jump in the diverge stage came from showing the model the bad answer, not from asking harder for a good one. "Don't give me Redis three times" did more than any adjective.
Structure is the product, in both directions. I started this thinking the forcing function was for the user — five fields, no skipping. It turned out the schema does the same job to the model. The intake constrains the human's vagueness; the JSON schema constrains the model's. Same idea pointed at both ends of the pipe, and that symmetry is most of why the output is usable.
The artifact is the point, not the conversation. Everything upstream — the framing, the questions, the options — exists to produce one thing: a brief that's useful to someone who wasn't in the room. Designing backward from that artifact is what kept the tool from drifting into being another chat box with extra steps.
The question I'm still sitting with
Does forced divergence ever cost you the obvious right answer?
Sometimes the honest output is "there's one sensible approach here and the other three are strawmen." Scaffold is built to always produce a spread, and a spread implies the options are comparably viable. For a genuinely one-answer problem, manufacturing three alternatives to sit next to it could make a clear call look like a close one — divergence as a feature actively working against clarity.
My current bet is that the failure is rare and the cure is worse: teams reach for premature certainty far more often than they suffer from too many options, so erring toward divergence is the right default. But I don't have enough real usage yet to know where that line actually sits. That's the thing I'll be watching for as people put their own problems through it — the case where the most useful brief Scaffold could write is "stop overthinking this, the first option is correct," and whether the tool is honest enough to say so.