📝 HindiOCR-VLM

Adapting Vision-Language Models for Hindi OCR

Shaon Bhattacharyya*, Souvik Ghosh*,Prantik Deb*, Ajoy Mondal, C.V. Jawahar
CVIT, IIIT Hyderabad
* Equal Contribution.
arXiv Supp Models Code Data
To be presented at ICDAR 2025 (Sept 16 - 21; Wuhan, China.)
Key Contributions
Abstract

Optical Character Recognition (OCR) for Indian languages presents unique challenges due to the diversity and complexity of scripts, which include intricate characters, diacritics, and various writing styles. This study introduces HindiOCR-VLM, an initiative to adapt Vision-Language Models (VLMs) specifically for OCR in Hindi. We build on a pre-trained model initially developed for documents in Chinese and English and utilize Low-Rank Adaptation (LoRA) to fine-tune it effectively for multi-domain applications. Inspired by human learning processes, we propose a progressive learning approach --- a training strategy to enhance language acquisition and accelerate convergence. Furthermore, we leverage the rich representations of the vision encoder to support multi-domain training across printed, handwritten, and scene text. Our experiments demonstrate how VLMs tackle the complexities of Indian scripts, such as Devanagari, leading to improved character and word recognition accuracy. Comparative evaluations against existing benchmarks reveal that {HindiOCR-VLM} outperforms domain-specific models, establishing a unified, generalized multi-domain model that showcases superior performance across all domains. This work marks a significant advancement in OCR technology for the Hindi language.

Proposed Pipeline
Proposed Pipeline Diagram

The proposed pipeline leverages a modern Vision-Language Model, fine-tuned progressively for Hindi OCR tasks. The architecture integrates image and text modalities, enabling robust recognition across diverse document types.

Training Dataset
Training Dataset Example

The in house training dataset for HindiOCR-VLM comprises of a diverse collection of printed, handwritten, and scene text images in Hindi. It includes variability in fonts, writing styles, and real-world conditions necessary for robust OCR model training. For mixed-modality training, we used 400k word-level images per modality (1.2M total), with both setups evaluated on a challenging test set. We have released the test dataset for the community to evaluate the performance of the models.

Level Train Val
Words 11,924,480 19,654
Line 1,016,000 1,016
Block 352,737 7,054
Page 36,340 181
Modality Pages Words
Printed 150 47,587
Handwritten 100 4,347
Scene Text 50 378
Table 1: Dataset statistics — left: Training for Printed Text and right: Challenging Test Set
Results
Printed Text Result
Printed Text
Handwritten Text Result
Handwritten Text
Scene Text Result
Scene Text
Single Stage OCR Results
Performance comparison with commercial and non-commercial OCRs on Printed Text. † and ‡ indicate two step and single step approaches, respectively.
Method WRR CRR
Google OCR 87.35 96.16
Azure OCR 87.17 96.94
CRNN 87.32 96.22
Surya OCR 83.69 80.66
HindiOCR-VLM 90.77 95.05
HindiOCR-VLM 84.74 92.63
Two Stage OCR Results
Shows the performance comparison of OCR methods on different text modalities. A '-' in the table indicates that the model is not applicable for that modality.
Method Printed Text Handwritten Text Scene Text
WRR CRR WRR CRR WRR CRR
Google OCR 87.35 96.16 73.27 80.88 74.38 87.56
Lipikar - - - - 65.81 74.16
Azure OCR 87.17 96.94 - - - -
Domain Specific OCR 87.32 96.22 73.69 69.63 71.21 85.54
Surya OCR 83.69 80.66 - - - -
HindiOCR-VLM 91.28 97.12 75.21 89.78 67.15 84.85
Bibtex