[Technical Article] OCR Upgrade

A few months ago, Deepin OCR was released and offered the text recognition function.

Though it fitted the basic needs of text recognition, there were two obvious shortcomings in this Tesseract-based solution: First, the recognition was not very accurate, especially in complex graphic and text pictures — there were whitespaces that should not appear among Chinese Characters; Second, the recognition speed was relatively slow, which mainly occurred on long pages and large pictures.

Therefore, while the previous version of Deepin OCR was released, the alternative OCR solution was already under investigation. After repeated research, adaptation, and verification, and considering the support for hardware architectures of full domestic platforms, we finally chose PaddleOCR-NCNN as the backend of Deepin OCR.

1.Performance Improvement

First of all, only the back-end model of Deepin OCR is replaced, and no changes have been made to the front-end interaction, so as to maintain the consistency of user experience.

Comparing the test experience of text recognition for the same object, we find that after the replacement, the speed is increased by about 5 times, the accuracy rate is improved, and it can effectively deal with more complex scenes (such as low-contrast pictures, cluttered-background pictures).

Here is an extreme comparison:

Secondly, the deb package of the new model and the size after deployment have also been significantly optimized:

a.Before replacement: The deb package exceeds 50 MB, and the model alone occupies as much as 64 MB after deployment;

b.After replacement: The deb package is only 18 MB, and the total size after deployment is just about 30 MB.

In short, the replacement makes Deepin OCR smaller and stronger!

2.Technical Implementation

As mentioned above, in this Deepin OCR upgrade, we mainly replace the backend with PaddleOCR-NCNN. You may feel confused, but don’t panic, we will introduce it clearly.

First, let’s learn about the structure of Deepin OCR:

As shown in the figure, its front-end part is very simple: there is only a DBus interface provided for applications, and a graphical interface used to display the recognition results. This time we keep the front-end part unchanged and only modify the backend algorithm module responsible for text recognition. Next, we will go deep into the back-end algorithm module to get what has been upgraded in Deepin OCR.

2.2 Introduction of PaddleOCR

In this part, we will introduce the deep learning project PaddleOCR, which replaces the previous backend. The official name of its core algorithm is PP-OCR, and we use its v2 version.

Project address: github.com/PaddlePaddle/PaddleOCR

Paper reference: arxiv.org/pdf/2109.03144.pdf

First, we need to know the basic workflow of OCR:

Detect text and output the position of the text block;
Crop the picture according to the position of the text block;
Perform text recognition on each cropped image and output the recognition result.

Regarding the text detection process, PP-OCR adopts DBNet (DB means Differentiable Binarization) (Paper reference: arxiv.org/pdf/1911.08947.pdf)

In brief, use a backbone to extract the basic features of the picture, cascade the features extracted by each layer of backbone to produce the final feature figures, then use the generated features to predict both the probability map and the threshold map, and finally fuse the two maps using the DB algorithm to obtain a DB binary map, and use the DB binary map to generate a usable text box.

In the lightweight model of PP-OCR, the backbone of the DB algorithm is MobileNetV3 Large x0.5, which effectively improves the detection speed on the CPU with the characteristics of MobileNetV3. At the same time, the size of the input matrix of the backbone in the training phase is 1 3 960 * 960, which can basically meet the needs of common pictures.

After completing the text detection task and outputting the text box, it enters the text recognition step. This step receives multiple text boxes from the text detection step and identifies the images in each text box separately. Among them, the text recognition algorithm of the v2 version of PP-OCR adopts a relatively traditional strategy based on the RNN (Recurrent Neural Network):

First, use the backbone to extract the basic features of the picture, then use the RNN variant LSTM (Long short-term memory) algorithm to further process the features, and finally perform the Softmax calculation on the processed feature matrix to obtain the probability distribution, and use the CTC (Connectionist temporal classification) algorithm to decode it and output the final recognition result.

In the lightweight model of PP-OCR, the backbone of the recognition algorithm is divided into two types:

Type 1: v2.0 version, using MobileNetV3 Small x0.5, whose characteristics are that the final produced model file is small and very fast on the CPU;

Type 2: v2.1 version, using MobileNetV1 Enhance. The final generated model file is slightly larger, the recognition speed is slightly slower, but the accuracy is higher. This version is only applied to Chinese and English models. The other official inference models only have the structure of v2.0. What’s more, since it is a recognition task that requires higher precision, the size of the input matrix of the network is relatively small, which is 1 3 32 * -1.

The above is about the main algorithm PP-OCR. Next, we will introduce our improvements compared to the official demo.

2.3 Our improvements compared to the official demo

In the official C++ deployment demo of PaddleOCR, multithreaded computing is only applied to single text recognition. However, in fact, for RNN networks such as LSTM, this cannot effectively improve the overall computing efficiency, because the RNN computing process is as follows:

The n-1st computing Time t-1

The nth computing Time t

The n+1st computing Tine t+1

As is shown in the above figure, RNN is a very unfriendly algorithm to multithreaded computing: the calculation of time t will not start until the calculation of time t-1 is completed, and the calculation of time t+1 will not start until the calculation of time t is completed. Therefore, we should use as few threads as possible in a single recognition.

Therefore, compared with the official demo, we made a change during deployment: we performed multiple recognition tasks at the same time, instead of using multiple threads in a single recognition, so as to improve the overall speed of the recognition process as much as possible.

2.4 From PaddlePaddle to NCNN

If you are not satisfied with the above introduction of the main algorithm, sit still, we will explain how we deploy PaddleOCR.

First of all, it is necessary to clarify the relationship between the frameworks:

As is known to all, a deep learning project is divided into two major phases: training and deployment:

In the training phase, a training framework aiming at providing functions is used to build and train models. The well-known training frameworks abroad include Tensorflow, Pytorch, MXNet, DarkNet, Caffe, etc., while the training frameworks in China mainly include PaddePaddle, MegEngine, etc;

In the deployment phase, an inference framework aimed at providing high performance is used to complete the final realization. Generally speaking, the training framework brings its own inference framework, such as the Paddle Inference under PaddePaddle. However, the inference framework of the general training framework is unsatisfactory in size and efficiency, so a number of inference frameworks with small size and high efficiency have emerged. The common lightweight inference frameworks in China include NCNN, MNN, Paddle Lite, etc. China is at the forefront of the world in inference frameworks.

The model of the main algorithm PaddleOCR in this technical upgrade is trained by the training framework PaddlePaddle. The official has also provided the inference models based on Paddle Inference and Paddle Lite. We finally used tools to convert the models of Paddle Inference to NCNN, and used NCNN to complete the final deployment.

As the leading open-source inference framework in China and in the world, what we like are its almost zero cost cross-platform capability, high inference speed, and minimal deployment volume. (Project address: github.com/Tencent/ncnn)

It strongly relies on Intel’s MKL repository under x86 architecture, which is a closed source software repository with U.S. patents. Finally, we decided to give it up.

PaddleOCR is a sub project under PaddlePaddle. The official only provides the PaddlePaddle original models, Paddle Inference models, and Paddle Lite models. At present, there are two paths for the conversion from PaddlePaddle to NCNN:

PaddlePaddle -> Paddle Inference -> ONNX -> NCNN

PaddlePaddle -> Paddle Inference -> Pytorch -> TorchScript -> NCNN

Path №1: the path officially recommended by PaddePaddle, which can directly convert models to the ONNX ones by using the Paddle2ONNX tool maintained by the official. Then, it can use the ONNX Simplifier tool provided by the ONNX community to merge and optimize the operators of the output models. Finally, it can convert the models to the NCNN ones by using the ONNX2NCNN tool maintained by the NCNN community.

Path №2: use the tools provided by the project github.com/frotms/PaddleOCR2Pytorch to convert the models of the Paddle Inference to the PTH files of Pytorch, then use the tools in Pytorch to convert the PTH files to the TorchScript files in Trace mode, and finally use PNNX in the NCNN toolbox to output them to PNNX and NCNN formats, and take the model files in NCNN format for deployment.

Here we use path №1.

The above content has introduced the main process of our deployment, but in addition, we are not only using NCNN for lightweight deployment.

2.5 Cropping feature of OpenCV

As a well-known image processing library, OpenCV plays an important auxiliary role in the algorithm process of PaddleOCR. However, the original OpenCV library is too large. If it is directly moved here, the enthusiasts in the community will certainly be dissatisfied. After investigation, we found a solution OpenCV-Mobile that can easily crop it. (Project address: github.com/nihui/opencv-mobile)

This project uses such means as disabling RTTI, replacing the lightweight picture read/write library — stbimage, removing zlib components, and disabling almost all compilation options to achieve the purpose of minimizing the volume of OpenCV’s main library to the maximum extent. What’s more, it satisfies most deep learning projects in terms of function and performance.

2.6 Handling of privacy issues

After the above technical upgrading, our OCR software has greatly improved in speed and accuracy. However, it is still an offline recognition scheme, and all recognition work will still be completed locally, so as to fully ensure the data and privacy security of users.

Ending

The above is the whole technical introduction of Deepin OCR. Welcome to upgrade and try. Welcome to discuss with us if you have any questions and suggestions!

Project address of Deepin OCR: github.com/linuxdeepin/deepin-ocr

The open-source projects involved in this upgrade are summarized as follows. The ranking is in no order, and each item is very important:

PaddleOCR：github.com/PaddlePaddle/PaddleOCR

Paddle2ONNX：github.com/PaddlePaddle/paddle2onnx

OpenCV：opencv.org

OpenCV Mobile：github.com/nihui/opencv-mobile

NCNN：github.com/Tencent/ncnn

ONNX Simplifier：github.com/daquexian/onnx-simplifier

Clipper：angusj.com/delphi/clipper.php

A million thanks to everyone in the open-source community!