JP Ferreira13 May 20269 min read

AI assists, humans decide: building review workflows that don't rot

Notes from rolling out AI-assisted PR review - what worked, what didn't, and why the only sustainable principle is that the human stays accountable.

AI
Engineering practice
Code review

// 01

The problem nobody named out loud

Most engineering teams I've watched adopt AI tools have done it the same way: not as a decision, but as a drift. One engineer starts pasting diffs into Claude or ChatGPT. Then two. Then five. Then a tech lead notices the team has informally adopted three different AI workflows, none of them documented, and the code review channel has started filling with suggestions that look plausible but don't actually understand the codebase.

This is the part nobody names out loud: the problem isn't that engineers are using AI. The problem is that they're using it inconsistently, without shared context, and without anyone having decided what good looks like. The output starts to feel uniform - the AI accent - and reviewers stop being able to tell whether a comment came from thought or from a model.

The instinct in some teams is to crack down. Block tools, write a policy, ban paste-ins. That doesn't work. Engineers will use AI whether you sanction it or not, because it makes parts of their job genuinely faster. The job isn't to stop adoption. It's to shape it into something the team can trust.

// 02

The principle that has to come first

Before any tooling decision, the team needs to agree on one sentence: AI assists, humans decide.

This sounds obvious. It is obvious. But almost nothing about how teams actually use AI reflects it. When an engineer ships a PR with comments they didn't write, generated by a model they didn't tune, against context the model didn't really see - they have outsourced judgement, not assistance. The accountability has quietly slipped.

The principle has to be load-bearing, not decorative. Every workflow we built had to pass one test: does the human still own the decision at the end of it? If the answer was no, we cut the workflow. If the answer was yes, we kept building.

This is what most AI governance discussions miss. They talk about prompts, model choice, data residency, cost. Those are real concerns. But none of them matter if the team has stopped thinking. The principle isn't a slogan. It's the criterion you measure every workflow against.

// 03

What actually worked

A few specific patterns held up over months of real use.

First: shared engineering context files. Every repository got a structured Claude-readable file describing its architecture, conventions, and the non-obvious decisions that aren't visible from the code alone. When a reviewer or contributor invoked AI assistance, the model had something to work with beyond the diff. The quality of suggestions improved sharply - not because the model got better, but because the context did.

Second: AI-assisted PR review that posted suggested comments locally first, where the reviewer could accept, edit, or discard them before any of it became public. The reviewer was always the one shipping the comment. The AI never spoke directly to the author. This sounds like a small distinction. It is the entire distinction.

Third: explicit guardrails against what we called AI slop - the long, plausible-sounding, low-content suggestions that pad reviews without adding signal. The model was instructed to be specific or silent. Vague observations were filtered out before the reviewer ever saw them.

Fourth: cross-repository context awareness, so engineers working in one service could surface conventions or patterns from related services without manually copy-pasting. This sounds like a small productivity win. It was actually the thing that made the workflow feel like a team practice rather than a private trick.

// 04

What didn't work, and why

Plenty of things failed before the workflow settled.

Fully automated review comments - where the AI posted directly into the PR - produced exactly the dynamic the principle was designed to prevent. Authors started replying to the AI as if it were a teammate. Reviewers stopped reading the diff because the AI had already covered it. Within two weeks, the quality of human review measurably dropped. We pulled it.

Generic model invocations - "review this PR" with no engineering context - produced suggestions that were technically reasonable and contextually useless. The model didn't know which patterns the codebase had explicitly chosen against. It would suggest extracting a helper that the team had deliberately inlined three months earlier for performance reasons. The suggestion looked clean. It was wrong. Reviewers had to do the work of catching it, which is exactly the work we were trying to assist with.

Letting every engineer build their own AI workflow - well-intentioned, autonomy-respecting - produced fragmentation that took longer to undo than it would have taken to standardise from the start. Autonomy is good. Five different unreviewed AI workflows running against the same codebase is not autonomy. It's drift.

// 05

The hardest part isn't technical

If I had to name the single hardest part of this work, it wasn't choosing models, designing prompts, or building tooling. It was cultural.

Engineers want AI to do more than the principle allows. Reviewers want to ship faster. Authors want fewer review cycles. There is a constant, low-grade pressure to let the model decide one more thing - to skip the human read, to accept the suggested comment as written, to trust the output because it sounds confident. Holding the line on AI assists, humans decide means saying no, sometimes, to workflows that would be locally faster but globally worse.

That pressure doesn't go away. The job of whoever owns the workflow is to keep holding the line, to keep noticing where the principle is quietly being relaxed, and to keep restoring it. Governance isn't a document. It's a practice.

// 06

What I took away

Three things, beyond the specific workflow.

One: the value of AI tooling in engineering is not in the model. It's in the surrounding system - the context files, the guardrails, the workflows, the cultural agreements about who decides. Teams that focus on the model alone end up with worse outcomes than teams that focus on the system around it.

Two: the principle has to be load-bearing. AI assists, humans decide is a sentence you can say in a meeting and forget. Or it's the criterion you measure every workflow against. The difference between those two outcomes is whether anyone is willing to delete a workflow that violates it.

Three: this work is more interesting than it looks. Building review workflows that don't rot is half engineering, half team design, and half cultural maintenance - yes, that's three halves, and that's the point. It's not a clean technical problem. It's the kind of problem senior engineers should be spending more time on, and writing about more honestly, than they currently do.

I later built the same principle into ClauseQ on purpose, as a working demonstration outside Domain - the AI-integration PRs linked below show AI proposing, and me deciding what actually merges.