Development of OCR mobile application including miss-recognized proofreading system using database search algorithm

In this paper, we introduce miss-recognition proofreading algorithm for optical character recognition(OCR) based on a database search method, implement the algorithm as an Android application, and verify its operation. We consider the proposed algorithm to have robustness against miss-recognition by searching the database partially using the character recognition results. The proposed algorithm selects several characters randomly take out of OCR results, search the database which is prepared in advance for several times using selected characters, and decide the character string with the highest number of search as the proofreading results. Also, we implement this algorithm as the Android OCR application. This application consists of ﬁve components as follows. (1) Image input (2) Pre-processing (cropping, denoising, and binarization) (3) Character recognition (using Tesseract-OCR software developed by Google and Hewlett Packard) (4) Correct miss-recognition (using the proposed algorithm), (5) Texts output In this research, we present the recognition accuracy when we use the proposed algorithm and not use. Although it depends on the recognition accuracy of the OCR itself, our results reveal that the recognition rate is improved by the proposed method.


Introduction
In recent years, with the advanced internet technology, the logistic system has changed a lot, and international trade has broadly spread.As a result, the information printed on the paper needs to be converted to digital data at the receiver side.
Therefore, the use of character recognition in industrial applications is increasing.When converting the printed characters to character codes, manually input them to the computer or use a technology called Optical Character Recognition(OCR) are the usual two ways.In some case, OCR software does an excellent job, but some problems and challenges can occur as well.For example, usually, OCR with images produced by scanners gives high accuracy and excellent performance.However, acquire images with a scanner, it is necessary to pay the introducing cost for such a device.Besides, in the logistic process, it is usually difficult to read a delivery slip attached to a product or package with a non-portable scanner.For this reason, there is a high demand for a character recognition system implemented as a smartphone application.However, images produced by smartphone cameras usually are not as good of input as scanned images to be used for OCR due to the environmental or camera related factors as follows (1) (1) Character settings (Fonts and Font sizes) (2) Print conditions (Blurring, Resolution) (3) Recognition conditions (Lighting, Rotation) Currently, None of the exist OCR software that achieves 100% recognition.Therefore, it said that miss recognition is an unavoidable problem if using OCR software to finish a character recognition task.For this reason, reducing the miss recognition of OCR in such an environment is an active research area.
There are two approaches to reducing miss recognition.One is the prevention of miss recognition, and the other is the proofreading of miss recognition.The former is an approach of reducing miss recognition by re-learning the character which OCR could not recognize correctly.This approach provides some improvement in recognition accuracy, but people have to verify what kind of miss recognition will occur, and needs the knowledge of OCR software as well.On the other hand, the latter is an approach of outputting the most probable character string from the characters and the arrangement of characters, even if recognition results include a miss recognition.Although this method requires the user to register a correct character string, they do not need to verify the miss recognition.This approach can improve the recognition accuracy if a good proofreading algorithm is implemented.
Therefore, in this research, we also proposed the verification algorithm using a database, implement OCR system as an Android application, and verify the recognition accuracy.

Application Structure
Figure 1 shows the process flow of the application we create in this research.There are five parts that struct this application.(c) OCR we use OCR software, Tesseract.Tesseract (2) is open source software for character recognition developed by Google and Hewlett Packard.In this process, we obtain recognition results.

(d) Proofreading
With the algorithm shown in 2.3, recognition results are corrected.This algorithm uses a pre-created database for proofreading.
(e) Output Finally, we obtain the output as a text format.

Pre-Processing
To eliminate unnecessary information or noise in an input image, we need to convert the image to the suitable format.Pre-processing reduces the redundant data and noise.We can enhance the effectiveness and easiness for an image to be processed in the OCR phase.In this research, we convert an input image by using cropping, non-local means denoising, (3) and adaptive Gaussian thresholding. (4)ropping is one of the most basic image processing: Removing unwanted outer areas from an input image.After this process, we obtain an image that has only an area for recognition.Figure 2 shows that the screenshot of our application when we crop the image.It crops the area enclosed by the pink rectangle.

Fig. 2. Cropping process
To remove the noise contained in the input image, we use the technique called non-local means denoising.The method base on a simple principle: Replacing the color of a pixel with an average of the colors of similar pixels.However, similar pixels are not necessarily close to each other.Therefore, we should scan a vast portion of the image in search of all pixels that resemble the pixel one wants to denoise.Denoising is then done by computing the average color of these most similar pixels.The similarity is evaluated by comparing a whole window around each pixel, and not just the color.
After Denoising, we binarize the input image.When we create a binarized image from a grayscale image, threshold processing is necessary.However, binarization with a single threshold can be suboptimal, particularly if the page background is of uneven darkness.Thus, we use an adaptive Gaussian threshold with the weighted average value of the close region of the target pixel as threshold.Gaussian function is used for weighting.

Database Search ALgorithm
To proofread the recognition results including miss recognition, we propose database search algorithm.4 shows the process flow of proposed algorithm.The process flow consists of three parts.First, it randomly extracts several characters from recognition result.At this time, this algorithm stores not only the characters but also arrangement of characters as features of strings.Second, it searches the database for strings that contain the extracted characters in order.And it counts the number of times of search for each string.The propose algorithm is to repeat these steps several times and finally output the most frequently searched string.Figure 5 shows a specific example of the correction process.

Fig. 5. An example of proofreading
Where "1NDICAT0R" is an OCR result.First, it randomly extracts some characters from the OCR result ("N, A, R", "1, D, T").Second, it searches database with extracted characters.In this case, strings containing characters in the order of "N, A, R" or "1, D, T" are searched.

Experimental Result and Discussion
In this research, we input an image containing the character strings of Figure 3 by a camera into our application, and recognized the character strings using the obtained image.Figure 6 shows the OCR results(Left) and proofreading results without using denoising and binarization(Right).

Fig. 6. OCR and proofreading results
Among the 17 character strings included in Figure 6, 4 strings ware correctly recognized without any mistakes.After proofreading process, the number of correct strings increased 4 to 15.
Figure 7 shows the OCR results(Left) and proofreading results(Right) with using proposed denoising and binarization.

Fig. 7. OCR and proofreading results
Among the 17 character strings included in Figure 7, 5 strings ware correctly recognized without any mistakes.After proofreading process, the number of correct strings increased 5 to 15.
In the results stated above, there was little difference between recognition accuracy with and without denoising and binarization.We think that this is because the denoising and binarization system included in the OCR software worked well.Besides, these results revealed that our proofreading algorithm can correct miss recognition.

Conclusions and Future Work
In this paper, we have presented miss-recognition verfication algorithm for optical character recognition(OCR) based on database search method and implemented OCR system, implemented the algorithm as an Android application, and verified its operation.Although it depends on the recognition accuracy of the OCR itself, our results revealed that the recognition accuracy is improved by the proposed method.
In the next future we are going to improve the algorithm to be higher recognition accuracy.

Fig. 1 .
Fig. 1.Flow chart of the application

Figure
Figure4shows the process flow of proposed algorithm.The process flow consists of three parts.First, it randomly extracts several characters from recognition result.At this time, this algorithm stores not only the characters but also arrangement of characters as features of strings.Second, it searches the database for strings that contain the extracted characters in order.And it counts the number of times of search for each string.The propose algorithm is to repeat these steps several times and finally output the most frequently