The Evolution of OCR-Free Visual Document Understanding: From Heuristic OCR to End-to-End Multimodal Transformers
DOI:
https://doi.org/10.71143/gaza7e33Keywords:
Keywords: Visual Document Understanding, Optical Character Recognition, , End-to-End Transformer, Document Understanding Transformer.Abstract
Visual Document Understanding (VDU) has undergone a seismic shift from multi-stage pipelines involving Optical Character Recognition (OCR) to end-to-end, pixel-to-text multimodal transformers. Traditional methods relied heavily on the accuracy of off-the-shelf OCR engines to provide textual inputs for downstream NLP models, creating a systemic vulnerability known as OCR error propagation. This paper provides a comprehensive review of the emerging OCR-free paradigm. We dissect the architectural transition from Layout-aware Transformers to pure Vision Transformers (ViT), hierarchical structures like Swin, and generative frameworks such as Donut and Pix2Struct. We analyze the core technical challenges, including the high-resolution bottleneck and cross-modal alignment, while evaluating state-of-the-art performance across industry-standard benchmarks. Finally, we propose future research trajectories in the context of Large Multimodal Models (LMMs).
Downloads
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.








