Benchmark Report: Ur-AI Parser API vs. Azure, LlamaParse, and IBM Docling on Japanese Enterprise and Financial Documents
The Ur-AI Parser API achieves parity with Azure Document Intelligence and LlamaParse on text and table reasoning — and outperforms Azure by 28 percentage points on chart understanding. Validated on 25 Japanese enterprise and financial documents across 363 tasks, with results that traditional OCR metrics (CER, TEDS) fail to capture.
Evaluation metric design conducted in collaboration with Hajime Hotta of the Hajime Institute.
Abstract
This report evaluates next-generation document transformation (Doc2Md) models designed for enterprise AI systems and LLM applications. Logical reasoning over unstructured financial data — such as business plans and presentation slides — remains a core challenge in modern AI architectures.
Static evaluation methods prioritizing visual or superficial formatting accuracy (CER, AST/TEDS) fail to reliably measure practical performance in downstream LLM reasoning tasks [3][4].
We propose an evaluation framework adopting the LLM-as-a-Judge paradigm [5] to directly test whether downstream LLMs can solve tasks using the extracted text. Validation was conducted on complex Japanese enterprise and financial documents (25 documents, 165 QA tasks, 198 reasoning tasks over 100 charts). The Ur-AI Parser API performs comparably to leading enterprise foundational models in semantic restoration capability (reasoning robustness), a prerequisite for real-world deployment. This report establishes "AI-readiness" as a standard for document processing and demonstrates the optimal transformation architecture for AI applications.
1. Why Traditional OCR Metrics Fail for AI Applications
Historically, OCR and parsers rely on CER for string matching and TEDS for table structure alignment. Measuring system reliability purely on these metrics introduces significant risk in production environments.
1.1 The Gap Between Visual Accuracy and Task-Solving Capability
The primary point of failure in document-fed AI is data representations the LLM cannot interpret. Loss of semantic relevance or VLMs misreading graphs (visual hallucination in chart understanding [6]) dictate incorrect outputs. Visually accurate Markdown and high TEDS scores have limited utility if the downstream LLM cannot logically parse the underlying dataset structure.
As distance-based metrics penalize minor formatting disparities in semantically identical Markdown [3], recent studies [5][4] indicate minimal correlation between superficial string matching and actual LLM answering capability. Consequently, legacy form-based metrics can misrepresent performance forecasting for document-centric AI applications.
2. What Ur-AI Optimizes For: AI-Readiness
The enterprise data bottleneck is no longer a lack of AI models, but the quality of unstructured data fed into them. We are not building standard OCR, basic parsing tools, or isolated RAG infrastructure.
The Ur-AI Parser is an AI-ready document transformation system converting business documents into structured representations (Markdown, structured text, semantic representations) optimized for downstream reasoning. Operationally, it functions as the transformation layer between documents and AI systems.
2.1 Definition and Components of AI-Readiness
AI-Readiness defines how effectively a document's data representation can be interpreted and queried by LLM-based systems. It requires:
- Text clarity — absence of OCR noise or broken tokens
- Structural integrity — preservation of tables, hierarchy, and sections
- Numerical fidelity — mandatory for financial use cases
- Semantic continuity — unfragmented information
- Delimiter and formatting hygiene
2.2 Controlled Normalization
While generic OCR systems optimize strictly for visual accuracy, the Ur-AI Parser is engineered specifically for LLM readability.
To achieve this, we apply controlled normalization to improve machine readability without altering semantic meaning. In practice, this means inserting strategic delimiters to prevent table columns from bleeding into one another and standardizing excessive whitespace. By doing so, normalization substantially improves the utility of the data for AI environments while strictly preserving the underlying business data and numerical structures.
3. Benchmark Design
To rigorously assess this capability, our benchmark design deliberately shifts away from surface-level metrics toward task-based evaluations. The central guiding question is: Can downstream LLM applications reliably read the parsed text and execute complex logical reasoning?
3.1 Four Levels of Semantic Reasoning Tasks (L1–L4)
Reasoning robustness is categorized into four difficulty tiers:
- L1 — Extraction of isolated numerical values or facts
- L2 — Spreadsheet calculation and structural reasoning (e.g., calculating profit margin delta) [1][2]
- L3 — Hierarchical and contextual reasoning referencing macro and micro structural elements
- L4 — Scheme reasoning involving visual logical flows (charts)
4. Dataset and Methodology
We designed a transparent and rigorous validation setup to evaluate these systems (with a detailed methodology provided in the Appendix).
Dataset Composition
- Volume: 25 documents
- Domain: Japanese financial/business sectors (securities reports, medium-term plans, briefing slides)
- Tasks: 165 textual QA tasks distributed across L1 text retrieval, L2 tabular structure/calculations, and L3 cross-section reasoning; and 198 L4 multimodal reasoning tasks across 100 charts
Evaluation Process
- Generation & Verification: Initial QA datasets ("Golden QA") were generated cross-referencing architectures (including OpenAI GPT-5.4 Mini and Google Gemini 3 Flash) to neutralize model-specific formatting bias. Human experts ultimately locked Ground Truth variables strictly against original source pixels.
- LLM Judge Prompts: The evaluation prompt injected the raw parsed Markdown as sole context into Google Gemini 2.5 Flash for reasoning execution. Google Gemini 3 Flash served as the definitive Judge model — scoring responses dynamically against the Golden QA via semantic accuracy logic rather than rigid string matching.
Benchmarked Models (March 2026)
- Azure Document Intelligence (Enterprise API standard)
- LlamaParse (Third-party API standard)
- IBM Docling (State-of-the-art local OSS)
- Ur-AI Parser API
5. Results
The task evaluation results indicate that while models optimized for downstream reasoning show measurable differences in multimodal task performance, these capabilities are largely unreflected in traditional distance-based metrics.
5.1 Multimodal Evaluation of Infographics (L4 Category)
Handling complex visual data is a primary requirement for document processing pipelines. We evaluated 198 L4 queries across 100 charts.
| Method | Accuracy | Rank |
|---|---|---|
| Google Gemini 3 Flash | 78.8% | 1 |
| Ur-AI Parser | 71.2% | 2 |
| Azure Document Intelligence | 42.9% | 3 |
IBM Docling and LlamaParse excluded due to incompatible image analysis capabilities.
The Ur-AI Parser recorded a 71.2% accuracy in this category, compared to 42.9% for Azure Document Intelligence. Dedicated context-heavy VLMs (Google Gemini) achieved the highest score (78.8%). However, replacing raw image inputs with the structured text representations generated by the Ur-AI Parser stabilizes data extraction and systematically mitigates visual hallucinations in chart interpretation [6].
5.2 AI Task Evaluation Performance (L1–L3 Categories)
We also evaluated accuracy across standard textual and structural reasoning tasks (165 queries).
| Method | Overall Accuracy | Rank |
|---|---|---|
| Azure Document Intelligence | 72.1% | 1 |
| Ur-AI Parser | 70.3% | 2 |
| LlamaParse | 70.3% | 2 |
| IBM Docling | 68.5% | 4 |
The Ur-AI Parser recorded a 70.3% overall accuracy, performing alongside Azure Document Intelligence (72.1%) and LlamaParse (70.3%). In the L2 domain (table understanding and calculation), the Ur-AI Parser maintained an accuracy range of 77%–78%, comparable to Azure Document Intelligence. This indicates that the normalization process preserves underlying structural integrity without restricting calculation capability. For L3 cross-section tasks, accuracy settled at 62%, suggesting that extracting header representations in deep hierarchies remains an architectural constraint across the tested models.
5.3 Traditional Metrics (CER/TEDS) vs. Task Capability
Evaluating these models using legacy alignment metrics shows a divergence from the task-based results above.
| Method | CER (Error Rate) ↓ | TEDS ↑ |
|---|---|---|
| Google Gemini 3 Flash | 0.0059 | 0.9368 |
| LlamaParse | 0.0502 | 0.4092 |
| Azure Document Intelligence | 0.0809 | 0.4076 |
| IBM Docling | 0.1498 | 0.4881 |
| Ur-AI Parser | 0.2141 | 0.4268 |
Using strict CER, the controlled normalization applied by the Ur-AI Parser incurs penalties, increasing the measured error rate. TEDS scores also show variance depending on the Markdown syntax used for the Ground Truth [3]. However, comparing these figures with the L1–L4 capability evaluations (Sections 5.1 & 5.2) demonstrates that traditional metrics do not consistently predict downstream LLM task proficiency, indicating the limitations of relying solely on visual replication scores for AI applications.
6. Error Analysis
A detailed classification of errors helps map systemic constraints and identifies clear requirements for future architectural iterations.
For instance, single-character misrecognition — such as mistaking a minus sign ("△161,921") for a "4" — remains a fundamental limit of underlying spatial recognition engines rather than a flaw in LLM reasoning. Furthermore, when dealing with dense objects or complex infographics, relying on pixel-area estimation inevitably generates a measurable degree of approximation drift.
7. Limitations
It is important to acknowledge the limitations of this benchmark. Primarily, the sample size of 165 queries restricts our ability to claim strict statistical significance (p-value) across the models. Consequently, the current dataset serves to indicate a comparable performance tier among the leading cohort, rather than proving absolute superiority over competitors.
Error modes and targeted fixes
| Failure Mode | Impact | Example | Planned Fix |
|---|---|---|---|
| OCR symbol confusion | Numeric calculation errors | Confusing "△" (minus) and "4" | Refine foundational OCR training data and rule sets |
| Dense chart approximation | Value drift | Approximating 145% as 140% | Upgrade chart structural extraction algorithms |
| Cross-section reasoning | Weaker L3 hierarchical tasks | Logic breakdown across pages | Deepen markup linking for structural headers |
8. Why This Matters for AI Applications
Ultimately, document transformation is a core determinant of AI application reliability, not just a simple preprocessing step.
Enterprise AI workloads (including investment due diligence, auditing, financial copilots, agent-based enterprise workflows, and knowledge extraction systems) depend on the reliable ingestion of unstructured documents. The AI-readiness of this transformed text directly impacts:
- Answer model (LLM) reasoning accuracy
- Information-driven hallucination rates
- Logical and computational stability over dense data arrays
Rather than focusing solely on visual transcription, the Ur-AI Parser API functions as an AI-ready transformation system. By passing normalized, semantically consistent data to downstream reasoning engines, it supports greater reliability across production AI workflows.
Appendix: Detailed Methodology
1. Creation of Golden QA
- Candidate generations ran across the 25 target documents via diverse prompts and conflicting LLM architectures.
- Data annotators (domain experts) cross-referenced source pages to lock Ground Truth variables (numerics, logic, context).
- Blind Protocol: QA dataset generation was completely blind to parser outputs. Questions were sourced exclusively from the raw PDFs/images. This structurally prevents "Potemkin bias" (cherry-picking questions targeting specific parser strengths), ensuring absolute benchmark neutrality.
2. Evaluation Prompts and LLM Judge Design
- Context prompts feeding output Markdown to Google Gemini models were tightly standardized.
- Meta-prompts restricted LLM generation purely to document data, barring external knowledge.
- Custom Judge scoring logic penalized practical semantic faults rather than superficial string deviations.
References
Sources
- [1] Chen et al. (2021). FinQA: A Dataset of Numerical Reasoning over Financial Data. EMNLP 2021. ↗
- [2] Zhu et al. (2021). TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. ACL 2021. ↗
- [3] Blecher et al. (2023). Nougat: Neural Optical Understanding for Academic Documents. ↗
- [4] Peng et al. (2025). UniDoc-Bench: A Unified Benchmark for Document-Centric Multimodal RAG. ↗
- [5] Zheng et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. ↗
- [6] Wang et al. (2025). ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding. ↗
Frequently Asked Questions
What is AI-Readiness in document processing?
AI-Readiness defines how effectively a document's data representation can be interpreted and queried by LLM-based systems. It requires text clarity (absence of OCR noise), structural integrity (preservation of tables and hierarchy), numerical fidelity (especially for financial data), semantic continuity, and delimiter hygiene. Traditional metrics like CER and TEDS measure visual accuracy but do not reliably predict whether downstream LLMs can solve reasoning tasks using the extracted text.
How does the Ur-AI Parser compare to Azure Document Intelligence and LlamaParse?
On 165 textual and structural reasoning tasks (L1–L3), the Ur-AI Parser achieved 70.3% overall accuracy — on par with LlamaParse (70.3%) and close to Azure Document Intelligence (72.1%). On 198 multimodal L4 tasks across 100 charts, the Ur-AI Parser recorded 71.2% accuracy versus 42.9% for Azure Document Intelligence. Traditional CER/TEDS metrics do not reflect this task-level performance, demonstrating that visual accuracy scores alone are insufficient for evaluating AI application readiness.
