Using DALI to speedup data processing in OCR
For a detailed exploration of the code, and methods, you can view this Github Repo. You can see how we perform inference with our pretrained model in notebook.
Training Deep Learning models for Optical Character Recognition (OCR) often involves complex data loading and augmentation pipelines. These preprocessing steps, if not optimized, can become a significant bottleneck, leaving expensive GPU resources underutilized and prolonging training times.
This document outlines our approach to leveraging the NVIDIA Data Loading Library (DALI) to accelerate the training process for our ResNet + BiLSTM + Attention OCR model built with PyTorch Lightning. We demonstrate substantial speedups compared to standard data loading methods and showcase the importance of hardware-aware pipeline configuration.
To create a uniform character set for the OCR model, the following text normalizations are applied to the labels:
Original Character(s) | Description | Normalized Character |
---|---|---|
“ , ” | Smart Quotes | " |
’ | Typographical Apostrophe | ' |
– , — , − | Various Dashes | - |
… | Ellipsis | ... |
Ð | Icelandic Eth (Uppercase) | Đ |
ð | Icelandic Eth (Lowercase) | đ |
Ö , Ō | O with accents | O |
Ü , Ū | U with accents | U |
Ā | A with macron | A |
ö , ō | o with accents | o |
ü , ū | u with accents | u |
ā | a with macron | a |
This normalization simplifies the vocabulary the model needs to learn.
The training leverages a combined dataset from the following sources:
Dataset | Train Samples | Validation Samples | Notes |
---|---|---|---|
vietocr | 441,025 | 110,257 | Random word images removed |
Paper (Deep Text Rec. Benchmark) | 3,287,346 | 6,992 | |
Synth90k | 7,224,612 | 802,731 | |
Cinnamon AI (Handwritten) | 1,470 | 368 | |
Combined Total | ~11.0 M | ~0.9 M |
Vietnamese Data: Please note that Vietnamese samples constitute only 1.76% (209,120 images) of this combined dataset, from VietOCR (207,282) and Cinnamon AI (1,838). This reflects the limited availability of public Vietnamese OCR data.
Due to the significant imbalance in the dataset, with English samples heavily outweighing Vietnamese ones, a direct training approach proved challenging. Training exclusively on the VietOCR dataset resulted in instability. Therefore, we adopted a two-stage training strategy:
The model we release is the one obtained after this fine-tuning process on the VietOCR data. You can see an example of how we perform inference with this model in our inference notebook.
Left (PyTorch DataLoader): The GPU frequently idles or is underutilized, indicating data bottlenecks. Right (NVIDIA DALI): The GPU maintains consistently high utilization. DALI keeps the L4 GPU working hard, reducing wasted cycles and speeding up training.
DALI is designed specifically to address data pipeline bottlenecks in deep learning workloads. Its key advantages include:
Before integrating DALI, it’s crucial to determine if it’s the right tool for your specific bottleneck. As the DALI documentation suggests:
“Q: How do I know if DALI can help me? A: You need to check our docs first and see if DALI operators cover your use case. Then, try to run a couple of iterations of your training with a fixed data source - generating the batch once and reusing it over the test run to see if you can train faster without any data processing. If so, then the data processing is a bottleneck, and in that case, DALI may help.”
Following this guidance:
fn.decoders.image
, fn.resize
, fn.rotate
, fn.color_twist
, fn.warp_affine
, and fn.noise.gaussian
, all readily available in DALI.How DALI Delivers Performance Gains:
If the above checks suggest DALI is suitable, here’s how it achieves acceleration:
Dataset
) with highly optimized C++ and CUDA kernels. This drastically reduces the overhead associated with Python execution for common data manipulation tasks.device='cpu'
) or the GPU (device='gpu'
). This is critical for optimization. As our Case Studies show: device='cpu'
) might be faster, preventing contention on the main training GPU.device='gpu'
) frees up the CPU and leads to better overall throughput. This flexibility allows tuning the pipeline for optimal performance on your specific hardware.fn.external_source
for flexibility with individual files, DALI also offers highly optimized readers for various packed dataset formats (TFRecord, RecordIO, Caffe LMDB, WebDataset). If your bottleneck test reveals slow performance even with minimal augmentations, or if dealing with slow storage (like the HDD in Case 3), switching to one of these formats and using the corresponding DALI reader (fn.readers.*
) can dramatically improve data ingestion speed before the processing steps.Integrating DALI into our PyTorch Lightning workflow involves a few key components, demonstrated in our codebase. We define DALI pipelines using the @pipeline_def
decorator, specifying data loading, augmentation, and processing steps using nvidia.dali.fn
operators.
For data loading from our custom format (images in a folder, labels in CSV), we utilize fn.external_source
coupled with a Python callable (ExternalInputCallable
) that reads image bytes and encodes labels.
To feed data into the PyTorch Lightning training loop, we wrap the DALI pipeline using DALIGenericIterator
which handles batch collation and transfer to the GPU. This setup replaces the standard PyTorch DataLoader
. For those interested in the specific implementation details, please refer to the DALI_OCRDataModule
and associated classes within our project’s source code.
We tested our DALI implementation across different hardware setups against a baseline PyTorch DataLoader (No DALI
).
(Note: Dataset size and specifics impact absolute times, but relative speedups are indicative.)
fn.readers.webdataset
) to address the data loading bottleneck first.Our experiments clearly demonstrate that:
device
parameter for operators is crucial for maximizing performance.By correctly identifying bottlenecks and leveraging DALI’s optimized kernels, parallelism, and hardware-adaptive execution, we can significantly accelerate OCR model training, enabling faster experimentation and development.