The Evolution of OCR-Free Visual Document Understanding: From Heuristic OCR to End-to-End Multimodal Transformers

Aarti Ahirwar; Ritu Tandon

doi:10.71143/gaza7e33

The Evolution of OCR-Free Visual Document Understanding: From Heuristic OCR to End-to-End Multimodal Transformers

Authors

Aarti Ahirwar Student, Department of Computer Science and Engineering, CSE-IET, SAGE University, Indore, Madhya Pradesh, India
Ritu Tandon Associate Professor, Department of Computer Science and Engineering, CSE-IET, SAGE University, Indore, Madhya Pradesh, India

DOI:

https://doi.org/10.71143/gaza7e33

Keywords:

Keywords: Visual Document Understanding, Optical Character Recognition, , End-to-End Transformer, Document Understanding Transformer.

Abstract

Visual Document Understanding (VDU) has undergone a seismic shift from multi-stage pipelines involving Optical Character Recognition (OCR) to end-to-end, pixel-to-text multimodal transformers. Traditional methods relied heavily on the accuracy of off-the-shelf OCR engines to provide textual inputs for downstream NLP models, creating a systemic vulnerability known as OCR error propagation. This paper provides a comprehensive review of the emerging OCR-free paradigm. We dissect the architectural transition from Layout-aware Transformers to pure Vision Transformers (ViT), hierarchical structures like Swin, and generative frameworks such as Donut and Pix2Struct. We analyze the core technical challenges, including the high-resolution bottleneck and cross-modal alignment, while evaluating state-of-the-art performance across industry-standard benchmarks. Finally, we propose future research trajectories in the context of Large Multimodal Models (LMMs).

Downloads

Download data is not yet available.

Downloads

Published

02-04-2026

Issue

Vol. 3 No. 2 (2026): IJRASHT: Vol 3, Issue 2, April-June 2026

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

How to Cite

Aarti Ahirwar, & Ritu Tandon. (2026). The Evolution of OCR-Free Visual Document Understanding: From Heuristic OCR to End-to-End Multimodal Transformers. International Journal of Research and Review in Applied Science, Humanities, and Technology, 3(2), 85-93. https://doi.org/10.71143/gaza7e33

Download Citation

The Evolution of OCR-Free Visual Document Understanding: From Heuristic OCR to End-to-End Multimodal Transformers

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Issn Block

Ugc Guidelines

Indexing

Make a Submission

publisher

Policies

LOGIN FORM

Aims & scope

Information

Latest publications

Keywords

Visitor-counter