Beyond benchmarks

Measuring AI Tax Accuracy: Comparing Filed to ChatGPT, Claude, and Gemini on an Open Benchmark

Alex Griffin | Product Marketing Lead at Filed

navigate

The question of whether AI can prepare taxes has moved from speculation to rigorous testing. Earlier this year, Column Tax released TaxCalcBench, an open-source benchmark designed to evaluate how well frontier AI models can calculate personal income tax returns. Their findings were sobering: even the best-performing large language model achieved only 32.35% accuracy on a simplified set of federal-only tax returns.
‍

At Filed, we participated in the TaxCalcBench evaluation to understand where our system stands relative to these baseline measurements. Our results tell an important story, not just about performance metrics, but about what it actually takes to build AI systems that work in the complex reality of tax preparation.

‍

Understanding TaxCalcBench's contribution

‍

Column Tax deserves credit for creating the first rigorous, open-source benchmark for AI tax calculation. TaxCalcBench tests models on 51 Tax Year 2024 federal tax returns, covering various filing statuses, income sources, and common credits and deductions. The benchmark focuses specifically on the calculation phase of tax preparation: given complete, correctly formatted information, can an AI model compute the tax return accurately?
‍

The evaluation criteria are unforgiving, as they should be for tax preparation. A tax return is considered "correct" only if every evaluated field matches the expected value exactly. This strict metric reflects the reality of tax filing: the IRS doesn't accept returns that are "close enough." If Line 16 shows $2,789 in tax owed but the correct amount is $2,792, the return is wrong. Period. There's no partial credit in tax compliance. The benchmark also reports a "lenient" metric allowing for plus-or-minus $5 differences per line, which provides useful diagnostic information about error patterns but has no practical relevance for actual tax filing.

Correct returns (strict): Returns are considered “correct” if every evaluated line matches the expected value.

Correct returns (lenient): Every evaluated line is within +/- $5 of the expected value.

Correct by line (strict): The % of evaluated lines that match the expected value.

Correct by line (lenient): The % of evaluated lines that are within +/- $5 of the expected value.

The benchmark revealed two primary failure modes across all tested models. First, models consistently used tax bracket percentage-based calculations instead of the IRS-mandated lookup tables. For taxable income under $100,000, the IRS instructions explicitly require using the Tax Table, not bracket calculations. This seemingly small deviation produces results that are often close but not exact, reflecting a fundamental misunderstanding of tax compliance requirements.

‍

Second, models made frequent calculation errors, particularly on complex forms like Form 8962 (Premium Tax Credit) and Schedule 2. These weren't simple arithmetic mistakes but rather structural misunderstandings: hallucinating incorrect line numbers, using wrong Federal Poverty Level figures, or misapplying eligibility rules. The interconnected nature of tax calculations means that a single error early in the process cascades through the entire return.

‍

Among the models tested in the original Column Tax study, Gemini 2.5 Pro performed best at 32.35% for correct returns (strict metric), followed by Claude Opus 4 at 27.45%. We expanded our analysis to include GPT-5 and GPT-5 with web search, which were not included in the original report but had been added to the TaxCalcBench GitHub repository at the time of our analysis. GPT-5 with web search achieved 41.67% accuracy, representing an improvement but still falling far short of reliability requirements for actual tax preparation.

Filed's results and what they mean

‍

Filed achieved 72.5% accuracy on the strict metric, 74.5% on the lenient metric (allowing for plus-or-minus $5 per line), and 94% accuracy on a line-by-line basis. On the metric that matters most (correctly computed complete returns), Filed's 72.5% represents a +30.8 percentage point improvement over GPT-5 with web search (41.7%), the best-performing standalone large language model, and a +42.4 percentage point improvement over the average across all LLMs tested in TaxCalcBench.

These results require some methodological context. TaxCalcBench provides test cases as JSON files with structured data inputs. Filed, however, is built to process actual documents: PDFs of W-2s, 1099s, scanned documents, and even handwritten notes. To participate in the benchmark, we converted the JSON files into real document formats that Filed could ingest, mimicking how our system encounters information in real-world practice.

‍

The performance gap between Filed and standalone LLMs points to a deeper architectural reality.

As Stelios Gisdakis, Head of AI at Filed, explains:

‍

"Prompts don't scale. Architecture does. When you build with multi-agent orchestration, you can add new capabilities without breaking what's already working. That's not just an engineering advantage, it's the only way to build systems that get better over time instead of more fragile."

‍

Filed doesn't rely on a single large language model. Instead, we employ multi-agent orchestration, where specialized agents handle different aspects of tax preparation. This architecture provides three key advantages:

‍

It allows us to optimize costs by using the most appropriate model for each task rather than applying an expensive frontier model to every operation.
It enables us to add new capabilities (supporting additional tax forms, handling new document types, integrating with new software platforms) without refactoring the entire system.
Most importantly for benchmark performance, it allows us to apply specialized validation and error-checking at each stage of processing.

‍

The architectural approach also addresses the consistency problem that Column Tax identified in their research. Single-model systems showed decreasing reliability across multiple runs (the pass^k metric), meaning you couldn't trust them to produce the same result twice. Multi-agent systems with explicit validation steps and deterministic components can maintain consistency while still leveraging the flexibility of AI where it's beneficial.

Beyond calculation: The full tax preparation workflow

‍

TaxCalcBench deliberately focuses on one piece of the tax preparation puzzle: calculation. This focus makes sense for establishing a baseline, but it's important to understand that calculation represents only one phase of a much larger workflow.

Real tax preparation involves at least four distinct phases: document collection, data entry, calculation, and review. Each phase presents its own challenges, and each requires different technical approaches.
‍

Document collection in the real world is messy. Tax professionals receive incomplete documents, multiple revisions, handwritten notes in margins, documents with coffee stains obscuring critical numbers, and clients who send 10 or more K-1s in various formats. Filed processes these real-world documents, extracting information even when it's presented in non-standard ways. The benchmark's clean JSON inputs don't capture this complexity.

‍

Data entry represents another significant challenge that the benchmark doesn't address. Tax professionals use specialized software platforms like Drake, ProConnect, UltraTax, CCH Axcess, and Lacerte. These platforms have their own interfaces, field validation rules, and workflow requirements. Filed uses robotic process automation to actually fill in tax returns within these professional platforms, navigating their interfaces and ensuring data is entered in the correct fields with proper formatting. This is fundamentally different from generating text output or producing a simplified markdown representation of a tax return.

‍

Review and quality control become even more critical when AI is involved. Tax professionals need to understand not just what the AI calculated, but how it arrived at that result. They need to verify that the right forms were used, that eligibility was determined correctly, and that the return makes sense in the context of the client's overall financial situation. Filed is designed to fit into the tax professional's workflow, providing transparency and enabling efficient review rather than trying to replace professional judgment.

‍‍

As Atul Ramachandran, CTO and Co-Founder at Filed, puts it:

‍

"We've never been interested in competing with foundational models. Our strategy has always been about building the best possible experience on top of these models, using them in ways that genuinely improve the daily work of tax professionals. We're not trying to replace expertise. We're trying to amplify it."

‍

The real-world datasets that Filed processes are far more challenging than any synthetic benchmark can capture. When a client sends a box of receipts, a partially filled spreadsheet, and a question about whether their side business qualifies for certain deductions, you're not just calculating taxes. You're interpreting intent, reconciling incomplete information, and applying professional judgment about ambiguous situations. These scenarios represent the actual value that Filed's AI can provide to tax professionals: handling the tedious, time-consuming work so they can focus on complex judgment calls and client relationships.

What this means for the industry

‍

The discussion around AI and tax preparation often focuses narrowly on model accuracy, but building production systems requires addressing numerous other challenges.

‍

Data protection stands out as particularly critical in tax preparation. Tax returns contain some of the most sensitive personal and financial information that exists. Filed implements data anonymization and protection measures throughout our system. When we train or fine-tune models, we ensure that client data is properly anonymized. When we process returns, we maintain strict security protocols. These aren't afterthoughts; they're foundational requirements that shape architectural decisions from the start.

‍

The business case for AI in tax preparation extends well beyond model performance metrics.

As Leroy Kerry, CEO and Co-Founder at Filed, puts it:

‍

"People often ask about our competitive moat, as if there's just one. But that's not how enduring businesses work. You need a product that users genuinely love. You need the ability to ship new capabilities quickly. You need distribution and clients who actively recommend you. You need brand and trust, especially in a field like tax preparation. You need culture and team. We've been fortunate to build strength in all of these areas, not just one."

This perspective is important because it acknowledges that accuracy, while necessary, is not sufficient. Tax professionals choose tools based on reliability, yes, but also on whether the tool fits naturally into their workflow, whether it saves them time on the tasks they find most tedious, whether it helps them serve more clients or provide better service, and whether they trust the company behind the tool to continue supporting and improving it.

‍

The integration challenge deserves particular emphasis. Tax professionals have already invested in learning specific software platforms. They have established workflows, templates, and quality control processes built around these platforms. Any AI solution that requires abandoning these investments or completely restructuring established workflows faces an uphill adoption battle. This is why Filed's approach of integrating directly with existing professional tax software through RPA (robotic process automation) makes practical sense. We meet tax professionals where they are rather than asking them to completely change how they work.

‍

The industry also needs to grapple with the distinction between tools for tax professionals and tools that attempt to replace them. The TaxCalcBench results suggest we're far from having AI that can independently prepare taxes with adequate reliability. Even Filed's 72.54% accuracy, while substantially better than standalone LLMs, means that roughly a quarter of returns would require human intervention and correction. This argues strongly for a human-in-the-loop approach, like Filed, where AI augments professional expertise rather than trying to supplant it.

A forward-looking perspective

‍

Column Tax's TaxCalcBench provides a valuable service to the industry by establishing clear baseline measurements and identifying specific failure modes. The benchmark will continue to evolve, with plans for state returns, more complex tax situations, and proper XML output formatting. These additions will make the benchmark increasingly representative of real-world complexity.

‍

Filed's results on the benchmark demonstrate that substantially better performance is achievable today through thoughtful architecture and multi-agent orchestration. But more importantly, our participation in the benchmark reinforces a broader point: benchmarks measure important things, but they don't measure everything that matters.

‍

Building AI systems that work in production requires addressing the full scope of real-world requirements. It requires processing messy, incomplete documents. It requires integrating with existing professional tools and workflows. It requires maintaining data security and client privacy. It requires providing transparency so that professionals can efficiently review and validate results. It requires building trust over time through consistent performance and responsive support.

‍

The path forward for AI in tax preparation isn't about waiting for models to achieve 100% accuracy on calculation benchmarks. It's about building complete systems that make tax professionals more productive, more capable of handling complex situations, and better able to serve their clients. It's about recognizing that accuracy is necessary but not sufficient, and that the real challenge lies in all the infrastructure, integration, and workflow considerations that surround the core calculation task.

‍

We're encouraged by the progress that both Filed and the broader industry have made. We're also realistic about how much work remains. The tax code is vast, complex, and constantly changing. Edge cases abound. Professional judgment will always be required. But with the right architecture, the right approach to integrating AI into professional workflows, and continued collaboration across the industry, we can build tools that genuinely improve how tax preparation works.

‍

We thank Column Tax for their research contribution and look forward to continued industry collaboration as we collectively work to understand and improve AI capabilities in this domain. The conversation should remain grounded in rigorous evaluation, honest about current limitations, and focused on building systems that serve the needs of tax professionals and their clients.

Sources:

Column Tax. (2025). TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task. (M.R Bock, Molisee, Ozer & Shah, 2025)
‍
Column Tax. (2025). TaxCalcBench GitHub Repository. GitHub.
‍
Column Tax. (2025). TaxCalcBench: A first-ever benchmark for evaluating AI’s ability to calculate tax returns. .

View Filed's full performance results across TaxCalcBench

Ready to see what AI-powered tax preparation can do?

Join tax professionals who are using Filed to handle document collection, data entry, and review so they can focus on what matters most.

Get started for free