The PR that taught me to build this assistant was a 400-line Riverpod refactor. The two human reviewers, including me, missed a ref.watch inside a callback that should have been ref.read. The bug shipped, caused unexpected rebuilds on a busy screen, and we found it three days later from a frame-time regression in our analytics. A reviewer with infinite patience and a checklist for every Dart and Flutter footgun would have caught it. We do not have one of those. We hired one.
This is the build log for that hire — an AI-powered code review assistant that runs on every PR, focuses specifically on Dart and Flutter mistakes, and gets useful enough that engineers stop ignoring it. The post is about the dev workflow, not the app. I will share the architecture, the actual prompts that earned their keep, and the parts that did not work.
Context: what AI code review is good and bad at
AI code review in early 2026 is good at:
- Style and idiom consistency within a known framework.
- Spotting deprecated API usage when you tell it about the deprecations.
- Catching common mistakes the team has agreed are mistakes.
- Suggesting refactors a junior engineer would not propose.
- Generating a first-pass summary of what changed and why.
It is bad at:
- Architectural judgment that depends on context outside the diff.
- Performance reasoning that needs profile data.
- Anything that requires running the code.
- Anything where the right answer is "it depends, ask the original author."
The mistake teams make is asking it to do everything. It is a checklist runner that is also good at suggesting better code. It is not a senior engineer. Treat it as a tireless junior with a really good memory and you will get value.
Architecture, two ways
There are two reasonable shapes:
- CLI / pre-commit. The developer runs the assistant locally before pushing. Fast feedback, no PR noise.
- CI / PR comment. A bot runs the assistant when a PR opens or updates. Visible to the team, slower feedback, more politics.
Most teams need both. The pre-commit path catches issues fast; the CI path holds the line on what merges.
Caption: pre-commit catches noise locally; CI enforces it for the team. Both paths share the same edge proxy, so cost controls live in one place.
The CLI
A small Dart program that takes a diff, slices it into hunks, and asks the LLM about each hunk with a focused prompt. The Dart program is itself a Flutter-friendly tool because every Flutter dev has the Dart SDK installed.
The model is queried via your edge proxy (same one your app uses, see Integrating an LLM into a Flutter app). One endpoint, one set of cost controls.
The prompts that worked
Three prompts earn their keep. They are in priority order: I run them sequentially and stop at the first that finds something the developer should look at.
Prompt 1: Riverpod and Bloc state-mistake prompt
This prompt alone catches the bug that prompted the project. The discipline is to keep the rules narrow and explicit. A vague "review this code for problems" produces vague results.
Prompt 2: deprecated and footgun API prompt
Prompt 3: refactor suggestion prompt (lower priority)
The third prompt is the one most teams over-rely on and it is the one that produces the most noise. Keep it lower priority and label its output clearly as suggestions.
The CI integration
A GitHub Actions job that runs dart run ai_review and posts comments via the GitHub API. The job posts inline review comments rather than a single bottom-of-PR comment, because inline comments are 10x more likely to be acted on:
The token is for your edge proxy, not the LLM provider directly. Same pattern as the mobile app.
The economics
A typical PR in our codebase has 200-400 added lines. Three prompts at 8000 tokens of input each plus 1000 tokens of output each costs less than a penny on a small model. We capped each PR at a dollar of LLM spend at the edge and never came close. The cost is not the issue at this scale.
The cost that is real: the time engineers spend dismissing low-quality suggestions. If 30% of comments are noise, engineers will start ignoring all of them, including the good ones. Tuning prompts to reduce false positives matters more than reducing dollar cost.
The signal that the assistant is helping
After three months we measured:
- Pre-commit prompt 1 caught a real Riverpod bug per week, on average.
- Pre-commit prompt 2 caught a deprecated-API usage every other week.
- Pre-commit prompt 3 (refactor suggestions) was acted on roughly 1 in 8 times.
- CI comments were resolved before merge in 70% of PRs.
The first two prompts paid for themselves. The third would have been net-negative if we had not labelled it as low-priority and kept it short.
Where it failed
- Cross-file reasoning. The assistant could not catch "this widget rebuilds because three files away the provider scope changed." That is what humans are for.
- Performance suggestions. Asking the model to comment on perf produced confident, wrong advice. We removed that prompt entirely.
- Architecture comments. "Should this be in
core/orfeatures/foo/?" depended on team conventions the model did not know. We solved this by including a one-paragraph project conventions blurb in the prompt. - New patterns. The model would happily approve a pattern we had recently decided to ban, because it had no memory of the team decision. We added a
CONVENTIONS.mdthat the prompt now includes for context.
What I would do differently
- I would have shipped pre-commit before CI. The fast local feedback loop is what changed habits. The CI bot just enforced.
- I would have started with one prompt, not three. Adding prompts after the first proved its value was easier than dialing back from a noisy initial version.
- I would have labelled refactor suggestions as "optional" in the comment text from the first day. Engineers treated unlabelled suggestions as blockers and got annoyed.
- I would have logged every suggestion's accept/dismiss outcome to a metrics dashboard. Without it I could not show leadership the value.
- I would not have asked the assistant to do performance review. It is genuinely bad at that and the comments wasted reviewers' time.
Closing opinion
Wire an LLM into your code review pipeline as a focused checklist runner with three or fewer prompts, run it pre-commit and on CI, and measure what it catches. Do not let it review architecture or performance. The assistant pays for itself if you scope it well, and becomes a cost if you let it free-form. For the in-app version of LLM integration, see Integrating an LLM into a Flutter app. For the wider GenAI shipping experience, see GenAI features in a Flutter app.
Written by the author of Flutterstacks
A developer who shipped production apps in Swift, Kotlin, and Dart — with a genuine native reference point that most Flutter writers simply don't have.
More articles →