Research & Insights

AI Tax Preparation: Why 94% Accuracy Still Needs Professional Review

Leroy Kerry | CEO at Filed
94%
Filed's line-by-line accuracy
42%
Best generic AI (complete returns)
51
Federal returns tested

The scariest errors on a tax return aren't the ones you find during review. They're the ones that go out the door.

Every tax professional knows the feeling. It's the second week of March, you've reviewed your 30th return this week, and you're scanning lines faster than you were on return number five. You trust your team. You trust the process. But attention isn't infinite. The returns reviewed at 10pm on a Thursday don't get the same focus as the ones reviewed on a Tuesday morning in October.

There's no metric for this. No dashboard that tells you "this return got 85% of the review attention it deserved." You just move to the next one, and trust that your process caught what it needed to catch.

The data that puts this in perspective

Earlier this year, Column Tax released TaxCalcBench, the first open-source benchmark designed to test whether AI models can accurately calculate federal tax returns. They tested 51 returns with a strict standard: a return was "correct" only if every single line matched the expected value exactly.

At the time of our original report, the leading general-purpose AI models (ChatGPT, Claude, Gemini) scored between 23-42% on complete returns. Filed scored 72.5% on complete returns and 94% line-by-line accuracy.

Since then, the frontier models have continued to improve, with the latest generation now approaching that same 94% threshold on line-by-line accuracy. That's worth acknowledging, and it's good news for the industry. But Filed uses multi-agent orchestration built on top of these same frontier models. When they improve, our accuracy improves too. The 94% we published months ago represents a floor, not a ceiling.

Why line-by-line accuracy is the metric that matters

What mattered most to us about the benchmark wasn't any single score. It was what the results revealed about where AI is reliable, where it still needs help, and what to build because of it.

Complete return accuracy measures how often every single line is perfect. One line off by a dollar on a 40-line return and the whole thing scores as wrong. That's a useful engineering metric, but it doesn't reflect how you actually work.

You review returns line by line. You check each number against source documents, catch errors, fix them, move on. So line-by-line accuracy answers the question you'd actually ask: "Out of every 100 lines, how many will need my attention?"

TaxCalcBench Results (at time of original report)

Line-by-line accuracy across all evaluated lines

Claude
78.3%
ChatGPT
81.5%
Gemini
81.2%
Filed
94%

Source: Column Tax TaxCalcBench, Tax Year 2024. Line-by-line metric. Scores reflect model performance at time of Filed's original report. The latest frontier models have since approached similar accuracy levels, and Filed's multi-agent system has continued to improve alongside them.

Generic AI (~80%)

~15 lines need attention

Filed (94%)

~4 lines need attention

For firms where review capacity is the bottleneck, this is the number that changes the math on whether AI saves your team time or creates more work.

Why even purpose-built AI needs a review layer

The benchmark proved something that should sound familiar: even a system specifically designed for tax preparation doesn't get everything right on the first pass.

That's not a failure. That's exactly why review exists.

Think about how your firm works today. A preparer builds the return. A reviewer checks it. That two-step process exists because no one expects the first draft to be perfect. The value of review isn't catching catastrophic errors. It's catching the small ones a preparer wouldn't notice in their own work.

AI is no different. Filed's prep engine achieved 94% line-by-line accuracy on the benchmark. That's meaningfully better than any general-purpose model. But that remaining 6% is exactly why we built a review engine on top of it.

Because Filed's review engine has full context on every source document that went into the return, it catches errors that the prep step introduced before they ever reach your desk. What lands in front of your team isn't a raw first draft with errors hidden across 40 lines. It's a reviewed return with specific flagged items that need professional judgment.

And if your team makes adjustments based on those flags, they can send the return back through Filed's review again before final sign-off. That cycle can repeat as many times as needed until the return is clean.

What the 6% actually looks like

The items that make it through to your level two review aren't calculation errors or data-entry mistakes. Those get caught by the system. What reaches your senior reviewers are genuine judgment calls.

A client sends a 1099-NEC with some scattered expense receipts. Is this a business or a hobby? That depends on a conversation, not a calculation.

A divorced couple both claim the same dependent. The resolution lives in a custody agreement, not in the tax data.

A client's prior-year return shows rental income, but this year they only sent W-2s. Did they sell the property? Stop renting? Forget to send the documents? The right answer requires a phone call, not a formula.

6% is what your level two review exists for

94% handled by the system

Calculations, data entry, compliance checks

6% for your expertise

Judgment calls, ambiguous situations, strategy

The 6% is what your level two review exists for. It always has been. The question is how much time your reviewers currently spend sifting through the other 94% just to find it.

Where generic AI breaks

The benchmark exposed two failure patterns that explain why a review layer is essential, not optional.

Every generic model tested used tax bracket calculations instead of the IRS-mandated Tax Table for income under $100,000. Close results, but not compliant. A first-year staff mistake that any reviewer would catch instantly.

On complex forms like Form 8962 and Schedule 2, the models made structural errors: wrong line references, incorrect Federal Poverty Level figures, misapplied eligibility rules. One early mistake cascaded through the entire return.

Common failure pattern

How one error cascades through an entire return

These compounding errors are the hardest to catch during manual review, because the downstream numbers can look plausible even when they're wrong. A review engine that validates against source documents at every step doesn't have this problem. It doesn't get tired, it doesn't lose focus, and it doesn't assume downstream numbers are right just because they look reasonable.

The models will keep getting better. That's the point.

As we noted above, the latest frontier models are already approaching the accuracy levels Filed achieved months ago. That convergence is inevitable, and it validates what we've been building toward: accuracy is table stakes. The question is what you do with it.

Filed isn't a wrapper around a single model. Our system uses multi-agent orchestration, where specialized agents handle different parts of the tax preparation process with layered validation and deterministic checks at each step. When the underlying models improve at reasoning and calculation, that improvement compounds across every agent in our pipeline. We're always leveraging the best available models, which means our accuracy stays ahead of any individual model on the leaderboard.

A second set of eyes that never loses focus

When your senior reviewers spend their time verifying data entry and checking routine calculations, they're doing necessary work at the wrong rate. These are your highest-value people, and every hour they spend on work that could be automated is an hour they're not spending on the judgment calls, client advisory, and strategic decisions that justify their billing rate.

But calculation has never been the full picture. TaxCalcBench tests one phase of tax preparation: given clean, structured data, can the AI compute the return correctly? No model on the benchmark leaderboard performs data entry. None of them navigate the actual interface of Drake, ProConnect, UltraTax, or CCH Axcess. None of them deal with the complexity of entering data into the correct rolled-over screen from a taxpayer's prior-year return, or adding new depreciation schedules on top of existing ones.

That's the real cost of the current review workflow. Not just the time. The margin.

Running TaxCalcBench with Filed requires converting the benchmark's structured JSON data into real tax documents, because our system is built to process actual PDFs, not clean inputs. That conversion process itself illustrates the gap between what benchmarks measure and what tax preparation actually requires.

In practice, tax preparation starts with messy, incomplete documents. PDFs with coffee stains. Handwritten notes in margins. Clients who send ten K-1s in five different formats. The real work involves extracting information from those documents, entering it accurately into professional tax software, validating it against source materials, and surfacing items for human review. That end-to-end workflow is where the gap between "an AI that can calculate" and "a system that helps a tax professional do their job" becomes clear.

Ready to see what AI-powered tax preparation can do?

Join tax professionals who are using Filed to handle document collection, data entry, and review so they can focus on what matters most.

Get started for free