The Evolution of OCR-Free Visual Document Understanding: From Heuristic OCR to End-to-End Multimodal Transformers

Authors

  • Aarti Ahirwar Student, Department of Computer Science and Engineering, CSE-IET, SAGE University, Indore, Madhya Pradesh, India
  • Ritu Tandon Associate Professor, Department of Computer Science and Engineering, CSE-IET, SAGE University, Indore, Madhya Pradesh, India

DOI:

https://doi.org/10.71143/gaza7e33

Keywords:

Keywords: Visual Document Understanding, Optical Character Recognition, , End-to-End Transformer, Document Understanding Transformer.

Abstract

Visual Document Understanding (VDU) has undergone a seismic shift from multi-stage pipelines involving Optical Character Recognition (OCR) to end-to-end, pixel-to-text multimodal transformers. Traditional methods relied heavily on the accuracy of off-the-shelf OCR engines to provide textual inputs for downstream NLP models, creating a systemic vulnerability known as OCR error propagation. This paper provides a comprehensive review of the emerging OCR-free paradigm. We dissect the architectural transition from Layout-aware Transformers to pure Vision Transformers (ViT), hierarchical structures like Swin, and generative frameworks such as Donut and Pix2Struct. We analyze the core technical challenges, including the high-resolution bottleneck and cross-modal alignment, while evaluating state-of-the-art performance across industry-standard benchmarks. Finally, we propose future research trajectories in the context of Large Multimodal Models (LMMs).

Downloads

Download data is not yet available.

Downloads

Published

02-04-2026

How to Cite

Aarti Ahirwar, & Ritu Tandon. (2026). The Evolution of OCR-Free Visual Document Understanding: From Heuristic OCR to End-to-End Multimodal Transformers. International Journal of Research and Review in Applied Science, Humanities, and Technology, 3(2), 85-93. https://doi.org/10.71143/gaza7e33

Similar Articles

11-20 of 38

You may also start an advanced similarity search for this article.