Research15 min read-April 10, 2026

Benchmark Report: Ur-AI Parser API vs. Azure, LlamaParse, and IBM Docling on Japanese Enterprise and Financial Documents

The Ur-AI Parser API achieves parity with Azure Document Intelligence and LlamaParse on text and table reasoning — and outperforms Azure by 28 percentage points on chart understanding. Validated on 25 Japanese enterprise and financial documents across 363 tasks, with results that traditional OCR metrics (CER, TEDS) fail to capture.

Sandeep Yella

Founder, CEO & CTO

Evaluation metric design conducted in collaboration with Hajime Hotta of the Hajime Institute.

Abstract

This report evaluates next-generation document transformation (Doc2Md) models designed for enterprise AI systems and LLM applications. Logical reasoning over unstructured financial data — such as business plans and presentation slides — remains a core challenge in modern AI architectures.

Static evaluation methods prioritizing visual or superficial formatting accuracy (CER, AST/TEDS) fail to reliably measure practical performance in downstream LLM reasoning tasks [3][4].

We propose an evaluation framework adopting the LLM-as-a-Judge paradigm [5] to directly test whether downstream LLMs can solve tasks using the extracted text. Validation was conducted on complex Japanese enterprise and financial documents (25 documents, 165 QA tasks, 198 reasoning tasks over 100 charts). The Ur-AI Parser API performs comparably to leading enterprise foundational models in semantic restoration capability (reasoning robustness), a prerequisite for real-world deployment. This report establishes "AI-readiness" as a standard for document processing and demonstrates the optimal transformation architecture for AI applications.

1. Why Traditional OCR Metrics Fail for AI Applications

Historically, OCR and parsers rely on CER for string matching and TEDS for table structure alignment. Measuring system reliability purely on these metrics introduces significant risk in production environments.

1.1 The Gap Between Visual Accuracy and Task-Solving Capability

The primary point of failure in document-fed AI is data representations the LLM cannot interpret. Loss of semantic relevance or VLMs misreading graphs (visual hallucination in chart understanding [6]) dictate incorrect outputs. Visually accurate Markdown and high TEDS scores have limited utility if the downstream LLM cannot logically parse the underlying dataset structure.

As distance-based metrics penalize minor formatting disparities in semantically identical Markdown [3], recent studies [5][4] indicate minimal correlation between superficial string matching and actual LLM answering capability. Consequently, legacy form-based metrics can misrepresent performance forecasting for document-centric AI applications.

2. What Ur-AI Optimizes For: AI-Readiness

The enterprise data bottleneck is no longer a lack of AI models, but the quality of unstructured data fed into them. We are not building standard OCR, basic parsing tools, or isolated RAG infrastructure.

The Ur-AI Parser is an AI-ready document transformation system converting business documents into structured representations (Markdown, structured text, semantic representations) optimized for downstream reasoning. Operationally, it functions as the transformation layer between documents and AI systems.

2.1 Definition and Components of AI-Readiness

AI-Readiness defines how effectively a document's data representation can be interpreted and queried by LLM-based systems. It requires:

Text clarity — absence of OCR noise or broken tokens
Structural integrity — preservation of tables, hierarchy, and sections
Numerical fidelity — mandatory for financial use cases
Semantic continuity — unfragmented information
Delimiter and formatting hygiene

2.2 Controlled Normalization

While generic OCR systems optimize strictly for visual accuracy, the Ur-AI Parser is engineered specifically for LLM readability.

To achieve this, we apply controlled normalization to improve machine readability without altering semantic meaning. In practice, this means inserting strategic delimiters to prevent table columns from bleeding into one another and standardizing excessive whitespace. By doing so, normalization substantially improves the utility of the data for AI environments while strictly preserving the underlying business data and numerical structures.

3. Benchmark Design

To rigorously assess this capability, our benchmark design deliberately shifts away from surface-level metrics toward task-based evaluations. The central guiding question is: Can downstream LLM applications reliably read the parsed text and execute complex logical reasoning?

3.1 Four Levels of Semantic Reasoning Tasks (L1–L4)

Reasoning robustness is categorized into four difficulty tiers:

L1 — Extraction of isolated numerical values or facts
L2 — Spreadsheet calculation and structural reasoning (e.g., calculating profit margin delta) [1][2]
L3 — Hierarchical and contextual reasoning referencing macro and micro structural elements
L4 — Scheme reasoning involving visual logical flows (charts)

4. Dataset and Methodology

We designed a transparent and rigorous validation setup to evaluate these systems (with a detailed methodology provided in the Appendix).

Dataset Composition

Volume: 25 documents
Domain: Japanese financial/business sectors (securities reports, medium-term plans, briefing slides)
Tasks: 165 textual QA tasks distributed across L1 text retrieval, L2 tabular structure/calculations, and L3 cross-section reasoning; and 198 L4 multimodal reasoning tasks across 100 charts

Evaluation Process

Generation & Verification: Initial QA datasets ("Golden QA") were generated cross-referencing architectures (including OpenAI GPT-5.4 Mini and Google Gemini 3 Flash) to neutralize model-specific formatting bias. Human experts ultimately locked Ground Truth variables strictly against original source pixels.
LLM Judge Prompts: The evaluation prompt injected the raw parsed Markdown as sole context into Google Gemini 2.5 Flash for reasoning execution. Google Gemini 3 Flash served as the definitive Judge model — scoring responses dynamically against the Golden QA via semantic accuracy logic rather than rigid string matching.

Benchmarked Models (March 2026)

Azure Document Intelligence (Enterprise API standard)
LlamaParse (Third-party API standard)
IBM Docling (State-of-the-art local OSS)
Ur-AI Parser API

5. Results

The task evaluation results indicate that while models optimized for downstream reasoning show measurable differences in multimodal task performance, these capabilities are largely unreflected in traditional distance-based metrics.

5.1 Multimodal Evaluation of Infographics (L4 Category)

Handling complex visual data is a primary requirement for document processing pipelines. We evaluated 198 L4 queries across 100 charts.

Method	Accuracy	Rank
Google Gemini 3 Flash	78.8%	1
Ur-AI Parser	71.2%	2
Azure Document Intelligence	42.9%	3

IBM Docling and LlamaParse excluded due to incompatible image analysis capabilities.

The Ur-AI Parser recorded a 71.2% accuracy in this category, compared to 42.9% for Azure Document Intelligence. Dedicated context-heavy VLMs (Google Gemini) achieved the highest score (78.8%). However, replacing raw image inputs with the structured text representations generated by the Ur-AI Parser stabilizes data extraction and systematically mitigates visual hallucinations in chart interpretation [6].

5.2 AI Task Evaluation Performance (L1–L3 Categories)

We also evaluated accuracy across standard textual and structural reasoning tasks (165 queries).

Method	Overall Accuracy	Rank
Azure Document Intelligence	72.1%	1
Ur-AI Parser	70.3%	2
LlamaParse	70.3%	2
IBM Docling	68.5%	4

The Ur-AI Parser recorded a 70.3% overall accuracy, performing alongside Azure Document Intelligence (72.1%) and LlamaParse (70.3%). In the L2 domain (table understanding and calculation), the Ur-AI Parser maintained an accuracy range of 77%–78%, comparable to Azure Document Intelligence. This indicates that the normalization process preserves underlying structural integrity without restricting calculation capability. For L3 cross-section tasks, accuracy settled at 62%, suggesting that extracting header representations in deep hierarchies remains an architectural constraint across the tested models.

5.3 Traditional Metrics (CER/TEDS) vs. Task Capability

Evaluating these models using legacy alignment metrics shows a divergence from the task-based results above.

Method	CER (Error Rate) ↓	TEDS ↑
Google Gemini 3 Flash	0.0059	0.9368
LlamaParse	0.0502	0.4092
Azure Document Intelligence	0.0809	0.4076
IBM Docling	0.1498	0.4881
Ur-AI Parser	0.2141	0.4268

Using strict CER, the controlled normalization applied by the Ur-AI Parser incurs penalties, increasing the measured error rate. TEDS scores also show variance depending on the Markdown syntax used for the Ground Truth [3]. However, comparing these figures with the L1–L4 capability evaluations (Sections 5.1 & 5.2) demonstrates that traditional metrics do not consistently predict downstream LLM task proficiency, indicating the limitations of relying solely on visual replication scores for AI applications.

6. Error Analysis

A detailed classification of errors helps map systemic constraints and identifies clear requirements for future architectural iterations.

For instance, single-character misrecognition — such as mistaking a minus sign ("△161,921") for a "4" — remains a fundamental limit of underlying spatial recognition engines rather than a flaw in LLM reasoning. Furthermore, when dealing with dense objects or complex infographics, relying on pixel-area estimation inevitably generates a measurable degree of approximation drift.

7. Limitations

It is important to acknowledge the limitations of this benchmark. Primarily, the sample size of 165 queries restricts our ability to claim strict statistical significance (p-value) across the models. Consequently, the current dataset serves to indicate a comparable performance tier among the leading cohort, rather than proving absolute superiority over competitors.

Error modes and targeted fixes

Failure Mode	Impact	Example	Planned Fix
OCR symbol confusion	Numeric calculation errors	Confusing "△" (minus) and "4"	Refine foundational OCR training data and rule sets
Dense chart approximation	Value drift	Approximating 145% as 140%	Upgrade chart structural extraction algorithms
Cross-section reasoning	Weaker L3 hierarchical tasks	Logic breakdown across pages	Deepen markup linking for structural headers

8. Why This Matters for AI Applications

Ultimately, document transformation is a core determinant of AI application reliability, not just a simple preprocessing step.

Enterprise AI workloads (including investment due diligence, auditing, financial copilots, agent-based enterprise workflows, and knowledge extraction systems) depend on the reliable ingestion of unstructured documents. The AI-readiness of this transformed text directly impacts:

Answer model (LLM) reasoning accuracy
Information-driven hallucination rates
Logical and computational stability over dense data arrays

Rather than focusing solely on visual transcription, the Ur-AI Parser API functions as an AI-ready transformation system. By passing normalized, semantically consistent data to downstream reasoning engines, it supports greater reliability across production AI workflows.

Appendix: Detailed Methodology

1. Creation of Golden QA

Candidate generations ran across the 25 target documents via diverse prompts and conflicting LLM architectures.
Data annotators (domain experts) cross-referenced source pages to lock Ground Truth variables (numerics, logic, context).
Blind Protocol: QA dataset generation was completely blind to parser outputs. Questions were sourced exclusively from the raw PDFs/images. This structurally prevents "Potemkin bias" (cherry-picking questions targeting specific parser strengths), ensuring absolute benchmark neutrality.

2. Evaluation Prompts and LLM Judge Design

Context prompts feeding output Markdown to Google Gemini models were tightly standardized.
Meta-prompts restricted LLM generation purely to document data, barring external knowledge.
Custom Judge scoring logic penalized practical semantic faults rather than superficial string deviations.

References

Sources

researchbenchmarkOCRdocument parsingLLMAI-readinessevaluation

Frequently Asked Questions

What is AI-Readiness in document processing?

AI-Readiness defines how effectively a document's data representation can be interpreted and queried by LLM-based systems. It requires text clarity (absence of OCR noise), structural integrity (preservation of tables and hierarchy), numerical fidelity (especially for financial data), semantic continuity, and delimiter hygiene. Traditional metrics like CER and TEDS measure visual accuracy but do not reliably predict whether downstream LLMs can solve reasoning tasks using the extracted text.

How does the Ur-AI Parser compare to Azure Document Intelligence and LlamaParse?

On 165 textual and structural reasoning tasks (L1–L3), the Ur-AI Parser achieved 70.3% overall accuracy — on par with LlamaParse (70.3%) and close to Azure Document Intelligence (72.1%). On 198 multimodal L4 tasks across 100 charts, the Ur-AI Parser recorded 71.2% accuracy versus 42.9% for Azure Document Intelligence. Traditional CER/TEDS metrics do not reflect this task-level performance, demonstrating that visual accuracy scores alone are insufficient for evaluating AI application readiness.

Ur AI

InsightsResearch

6 min