VietConizer, Vietnamese OCR with NVIDIA DALI

Using DALI to speedup data processing in OCR

For a detailed exploration of the code, and methods, you can view this Github Repo. You can see how we perform inference with our pretrained model in notebook.

Accelerating OCR Training with NVIDIA DALI: A Practical Guide and Case Study

1. Introduction

Training Deep Learning models for Optical Character Recognition (OCR) often involves complex data loading and augmentation pipelines. These preprocessing steps, if not optimized, can become a significant bottleneck, leaving expensive GPU resources underutilized and prolonging training times.

This document outlines our approach to leveraging the NVIDIA Data Loading Library (DALI) to accelerate the training process for our ResNet + BiLSTM + Attention OCR model built with PyTorch Lightning. We demonstrate substantial speedups compared to standard data loading methods and showcase the importance of hardware-aware pipeline configuration.

2. Data Processing and Details

Character Normalization

To create a uniform character set for the OCR model, the following text normalizations are applied to the labels:

Original Character(s) Description Normalized Character
, Smart Quotes "
Typographical Apostrophe '
, , Various Dashes -
Ellipsis ...
Ð Icelandic Eth (Uppercase) Đ
ð Icelandic Eth (Lowercase) đ
Ö, Ō O with accents O
Ü, Ū U with accents U
Ā A with macron A
ö, ō o with accents o
ü, ū u with accents u
ā a with macron a

This normalization simplifies the vocabulary the model needs to learn.

Datasets Used

The training leverages a combined dataset from the following sources:

Dataset Train Samples Validation Samples Notes
vietocr 441,025 110,257 Random word images removed
Paper (Deep Text Rec. Benchmark) 3,287,346 6,992  
Synth90k 7,224,612 802,731  
Cinnamon AI (Handwritten) 1,470 368  
Combined Total ~11.0 M ~0.9 M  

Vietnamese Data: Please note that Vietnamese samples constitute only 1.76% (209,120 images) of this combined dataset, from VietOCR (207,282) and Cinnamon AI (1,838). This reflects the limited availability of public Vietnamese OCR data.

3. Training Strategy

Due to the significant imbalance in the dataset, with English samples heavily outweighing Vietnamese ones, a direct training approach proved challenging. Training exclusively on the VietOCR dataset resulted in instability. Therefore, we adopted a two-stage training strategy:

  1. Initial Pre-training: The model was first trained on the entire combined dataset of approximately 11 million images. This allowed the model to learn general text recognition features from a large and diverse dataset, even though it was predominantly English.
  2. Fine-tuning on VietOCR: The checkpoint obtained from the initial pre-training phase was then used to fine-tune the model specifically on the VietOCR dataset. This step adapted the learned features to the nuances of Vietnamese text and improved performance on the target language.

The model we release is the one obtained after this fine-tuning process on the VietOCR data. You can see an example of how we perform inference with this model in our inference notebook.

4. Why NVIDIA DALI?

Left (PyTorch DataLoader): The GPU frequently idles or is underutilized, indicating data bottlenecks. Right (NVIDIA DALI): The GPU maintains consistently high utilization. DALI keeps the L4 GPU working hard, reducing wasted cycles and speeding up training.

DALI is designed specifically to address data pipeline bottlenecks in deep learning workloads. Its key advantages include:

5. When and How to Integrate DALI for Faster OCR Training

Before integrating DALI, it’s crucial to determine if it’s the right tool for your specific bottleneck. As the DALI documentation suggests:

“Q: How do I know if DALI can help me? A: You need to check our docs first and see if DALI operators cover your use case. Then, try to run a couple of iterations of your training with a fixed data source - generating the batch once and reusing it over the test run to see if you can train faster without any data processing. If so, then the data processing is a bottleneck, and in that case, DALI may help.”

Following this guidance:

  1. Verify Operator Coverage: Check the official DALI documentation for supported operators. Ensure DALI provides the necessary functions for your OCR preprocessing pipeline (e.g., image decoding, resizing, padding, rotation, color adjustments, noise addition). Our pipeline utilizes operators like fn.decoders.image, fn.resize, fn.rotate, fn.color_twist, fn.warp_affine, and fn.noise.gaussian, all readily available in DALI.
  2. Identify the Bottleneck: Perform the suggested test. Modify your existing training loop (without DALI) to load and preprocess one single batch of data, then repeatedly feed this same batch to the model for several training steps. Compare the training speed (e.g., iterations/second or time per step) in this fixed-data scenario to your normal training speed. If the fixed-data training is significantly faster, it strongly indicates that your data loading and preprocessing pipeline is limiting overall performance, and DALI is likely to provide a speedup.

How DALI Delivers Performance Gains:

If the above checks suggest DALI is suitable, here’s how it achieves acceleration:

6. Practical Integration and Experimental Results

Integrating DALI into our PyTorch Lightning workflow involves a few key components, demonstrated in our codebase. We define DALI pipelines using the @pipeline_def decorator, specifying data loading, augmentation, and processing steps using nvidia.dali.fn operators.

For data loading from our custom format (images in a folder, labels in CSV), we utilize fn.external_source coupled with a Python callable (ExternalInputCallable) that reads image bytes and encodes labels.

To feed data into the PyTorch Lightning training loop, we wrap the DALI pipeline using DALIGenericIterator which handles batch collation and transfer to the GPU. This setup replaces the standard PyTorch DataLoader. For those interested in the specific implementation details, please refer to the DALI_OCRDataModule and associated classes within our project’s source code.

We tested our DALI implementation across different hardware setups against a baseline PyTorch DataLoader (No DALI).

(Note: Dataset size and specifics impact absolute times, but relative speedups are indicative.)

Case 1 High-End GPU with Strong CPU (NVIDIA A6000)

Case 2 Cloud GPU (NVIDIA L4)

Case 3 Mid-Range GPU with Slow Storage (NVIDIA 3060 + HDD)

7. Analysis and Conclusion

Our experiments clearly demonstrate that:

  1. DALI significantly reduces OCR training time when the data pipeline is a bottleneck (up to ~25% speedup observed on L4).
  2. The optimal placement of DALI operations (CPU vs. GPU) is hardware-dependent. Tuning the device parameter for operators is crucial for maximizing performance.
  3. I/O is critical. On systems with slow storage, optimizing the dataset format and reading method (e.g., using WebDataset with DALI’s readers) is essential before DALI’s processing speedups can be fully realized.

By correctly identifying bottlenecks and leveraging DALI’s optimized kernels, parallelism, and hardware-adaptive execution, we can significantly accelerate OCR model training, enabling faster experimentation and development.