InsightsResearch6 min read-June 21, 2026

Why Frontier LLMs Fail at Parsing Japanese Documents (and What Makes Japanese Unique)

Frontier LLMs fail at Japanese documents because Japanese mixes three scripts, omits spaces between words, and is often written vertically — pushing models’ error rates on vertical text roughly tenfold. Here is what makes Japanese documents unique, where the models break, and what actually works.

Sandeep Yella

Founder, CEO & CTO

Frontier large language models fail at parsing Japanese documents because Japanese stacks together features that no other major language combines all at once: three scripts intermixed in a single sentence — kanji, hiragana, and katakana — alongside Latin letters and Arabic numerals, no spaces between words, and text that is frequently written top-to-bottom in vertical columns (tategaki). On vertically written Japanese, the character error rate of models like GPT-4.1 and GPT-5 rises roughly tenfold compared with the same text written horizontally. Layer in the dense reality of business documents — nested tables, business charts, red hanko seals, and tiny reading aids printed beside characters — and accuracy degrades further still.

If you have ever pasted a Japanese annual report, a real-estate registry, or a scanned contract into a frontier chatbot and watched it confidently return text that is subtly — or wildly — wrong, you have met the problem first-hand. These models read Japanese fluently when it looks like the web text they were trained on: clean, horizontal, modern. Real documents rarely cooperate.

What makes Japanese documents unique

Japanese is one of the most orthographically complex writing systems in active use. A single sentence routinely mixes three scripts at once: kanji (logographic characters borrowed from Chinese, of which students learn 2,136 standard jōyō kanji and the full set runs to tens of thousands), plus two 46-character syllabaries — hiragana for native words and grammar, and katakana for foreign loanwords and emphasis. Latin letters (rōmaji) and Arabic numerals appear freely alongside them.

Three intermixed scripts — kanji, hiragana, and katakana — often within a single word, with no visual delimiter between them.
No spaces between words. Japanese text wraps from line to line without regard for word boundaries; the reader infers where one word ends and the next begins.
Two writing directions. Modern documents use horizontal yokogaki (left-to-right), but contracts, newspapers, novels, and many official forms use vertical tategaki — columns read top-to-bottom, ordered right-to-left.
Thousands of visually similar characters. Dense kanji can differ by a single stroke and share components, making character-level precision unforgiving.

Japanese is not “English in a different alphabet.” It is three scripts, two reading directions, and no spaces — nearly every assumption an OCR model makes about Western text is wrong.

Why vertical writing breaks frontier LLMs

Vertical writing is where the gap becomes measurable. A November 2025 study evaluating multimodal LLMs on vertically written Japanese found that every frontier model tested performed dramatically worse on tategaki than on the identical content laid out horizontally. The reason is structural: these models are trained overwhelmingly on horizontal, left-to-right text, so when characters run top-to-bottom in columns ordered right-to-left, the model frequently reads them in the wrong order — or reverts to scanning horizontally and produces nonsense.

Model	Horizontal CER	Vertical CER
GPT-4.1	1.88%	18.2%
GPT-5	2.09%	21.3%
InternVL3-38B	0.89%	22.1%
Gemma 3 27B	2.13%	7.62%

Character error rate (CER) on single-column Japanese text — lower is better. Source: Evaluating Multimodal LLMs on Vertically Written Japanese Text (arXiv:2511.15059, 2025).

Character error rate (CER) measures the share of characters wrongly inserted, deleted, or substituted — lower is better. On horizontal text the frontier models are near-flawless, under roughly 2% error. Rotate the same text into vertical columns and GPT-4.1 jumps to 18.2% and GPT-5 to 21.3% — about one character in five wrong. No one can trust a contract clause when a fifth of it may be garbled, and cannot easily tell which fifth.

Real business documents are harder than test sentences

Those benchmark numbers come from relatively clean text. Actual Japanese business documents add layers that compound the problem:

Dense, nested tables. Japanese financial statements and registries pack multi-level header tables with merged cells; general-purpose models lose row–column alignment, so values land in the wrong cell.
Business charts and graphs. Visual reasoning over charts is where the gap is widest — established OCR and document-AI tools score only in the low-40 percent range on chart understanding, misreading axes, legends, and the values they encode.
Hanko seals. Red cinnabar stamps overlap printed text on contracts and approvals, occluding the very characters that carry legal weight.
Furigana and ruby text. Tiny kana printed beside or above kanji as reading aids get merged into the main text stream, corrupting the result.
Mixed directions on one page — a vertical body with horizontal table captions, page numbers, or stamps.
Handwriting and historical forms — older filings and handwritten annotations that sit far outside modern training data.

Worse than a visible error is an invisible one. Frontier LLMs are probabilistic: the same document can return slightly different output across runs even at temperature zero, and a model under pressure can fabricate plausible field values that were never in the document. In casual use that is an annoyance. In a high-stakes workflow — a financial close, a compliance filing, a contract review — a confidently invented number is a liability.

On a clean horizontal paragraph, a frontier model looks brilliant. On a stamped, vertically-set, multi-table Japanese contract, it looks brilliant and is wrong — which is worse.

Why this matters for any high-stakes AI workflow

Across construction, logistics, retail, manufacturing, and financial services, much of Japan’s business still runs on exactly the documents frontier models handle worst: vertically-set contracts, hanko-stamped approvals, dense financial tables, inspection sheets, and registry extracts. Whether you are feeding a RAG assistant, reconciling invoices, extracting figures for a decision, or routing a claim, a general-purpose chatbot that is 80% right on a tategaki page is not a time-saver — it is a hidden risk, because the 20% it misreads is indistinguishable, to a non-Japanese reader, from the 80% it gets right. In any high-stakes workflow, accuracy you cannot see is accuracy you cannot trust.

What actually works: document intelligence built for Japanese

The fix is not a bigger general-purpose model — it is a document layer engineered for real Japanese business documents. That is what Nebula, our document-intelligence engine, is focused on: the complex tables, business charts, hanko seals, and dense, mixed layouts where general-purpose models quietly fall apart. Nebula turns them into layout-preserved Markdown and structured JSON, with every value traceable to its place on the page — and it already performs strongly on Japanese business documents and on Japanese text in general. Harder cases such as fully vertical (tategaki) text are an active area we are advancing — not a solved problem for anyone in the field. The result is a dependable, auditable input layer beneath any downstream AI system — RAG assistants, analytics, agents, and review workflows — for teams in construction, logistics, retail, finance, and beyond. The principle is the same one behind everything we build: AI you control, insights you trust — with 100% auditability, not a black box you have to take on faith.

A practical test for any AI document tool you are evaluating on Japanese: hand it a vertically-written, hanko-stamped page with a nested table, then check the output character-by-character against the source. If it cannot show you where each value came from, it cannot be trusted with decisions that matter.

Sources

Japanese DocumentsOCRDocument IntelligenceLLMNebula

Frequently Asked Questions

Why do AI models struggle to read Japanese?

AI models struggle with Japanese because it combines three intermixed scripts (kanji, hiragana, and katakana), uses no spaces between words, and is often written vertically (top-to-bottom, in columns ordered right-to-left) rather than horizontally. Most models are trained mainly on horizontal Western and web text, so vertical Japanese, dense kanji, and complex document layouts push them well outside their comfort zone — error rates on vertical Japanese text can be roughly ten times higher than on horizontal text.

Can ChatGPT or Gemini read vertical Japanese text?

They can attempt it, but accuracy drops sharply. In a 2025 benchmark, frontier models that achieved under 2% character error rate on horizontal Japanese rose to roughly 18–21% error on the same text written vertically (tategaki), because they tend to read top-to-bottom columns in the wrong order. For casual reading this may be acceptable; for contracts, financial statements, or anything legally binding, the error rate is too high to rely on without verification against the source.

What is tategaki?

Tategaki is the traditional Japanese vertical writing format, in which characters are written in columns from top to bottom, and the columns are read from right to left. It remains standard in novels, newspapers, contracts, and many official forms, while horizontal left-to-right writing (yokogaki) is common in technical, scientific, and digital contexts. Many Japanese documents mix both directions on a single page.

How accurate is AI OCR on Japanese business documents?

It varies enormously with the document. On clean, horizontal printed text, frontier models can exceed 98% character accuracy. But on real business documents — dense tables, business charts, hanko seals, furigana, and vertical text — accuracy falls sharply. In benchmarks of Japanese enterprise and financial documents, established OCR and document-AI tools score only in the low-40 percent range on complex chart and table understanding, and traditional metrics like CER and TEDS miss the gap entirely — a page can look right and still be wrong. Reliable results require a document-intelligence system built for Japanese layout, tables, and charts, with output you can audit against the source.

Ur AI

Insights

6 min