Development of training data for Optical Character recognition using deformed printing characters

In recent years, with the advanced internet technology, the logistic system has changed a lot, and the international tread has broadly spread. As a result, the information printed on the paper needs to be converted to a digital data at the receiver side. Notably, the use of character recognition in industrial applications is increasing. When converting the printed characters to character codes, manually input them to the computer or use a technology called Optical Character Recognition(OCR) are the usual two ways. In some case, OCR software does an excellent job of this even some misses are converting, but for a legal restrained document, the miss converting may cause huge losses if the number or the product name is wrong, which we want to avoid in this research. Therefore, we propose denoising and binarization system for OCR and create training data using deformed printing characters to improve the accuracy of character recognition rate. The simulation result shows a reasonable character recognition accuracy when using the training data.


Introduction
Most of the B2C businesses transmit the order information through the internet directly, but for a B2B market, due to the security reason, the order information still be sent by a traditional manner, such as a paper-based business letter, or by facsimile.There are many disadvantages for converting a paper transmitted document to the digital text.E.g., noise can make the text image more difficult to read.The non-uniform color of the background in the printed material can make the binarization difficult.Also, character deformation can also happen.The character deformation causes the difference between the training data for character recognition included in OCR software and the read printing characters.This difference causes incorrect character recognition.To convert printed characters to character codes, we use Optical Character Recognition (OCR).In recent years various algorithms for OCR has been developed.As a result, character recognition accuracy has dramatically improved.However, there is no OCR software with a character recognition accuracy of 100% even now.Factors that make character recognition difficult are noise and non-uniform color of the background.Most OCR software enroll the denoising and binarization processing before starting the character recognition, but certain types of noise cannot be removed.Therefore, the binarization cannot be done correctly, which can cause accuracy rates low.Besides, printed characters are often distorted due to the data transmission system, such as the families can drop down the resolution of the image and made the small font size character unreadable.These distorted characters might be recognized wrongly.In this paper, we introduce denoising and binarization system for OCR, and OCR training data.The proposed system is built using Non-Local means denoising and adaptive Gaussian thresholding.OCR training data is created using deformed printing characters.

Non-Local Means Denoising
To remove the noise contained in the input image, we use the technique called non-local means denoising [1].The method base on a simple principle: replacing the color of a pixel with an average of the colors of similar pixels.However, similar pixels are not necessarily close to each other.Therefore, we should scan a vast portion of the image in search of all pixels that resemble the pixel one wants to denoise.Denoising is then done by computing the average color of these most similar pixels.The similarity is evaluated by comparing a whole window around each pixel, and not just the color.This filter is called non-local means and expressed as Where ((), ()) is an Euclidean distance between image patches centered respectively at p and q, f is a decreasing function and C(p) is the normalizing factor.
The denoising of a color image u = (u1, u2, u3) and a certain pixel p expressed as Where i = 1, 2, 3 and B(p, r) indicates a neighborhood centered at p and with size (2r+ 1) * (2r+ 1) pixels.The weight w(p, q) depends on the squared Euclidean distance d 2 = d 2 (B(p, f), B(q, f) ) of the (2f+ 1) * (2f+ 1) color patches centered respectively at p and q. (4) We use an exponential kernel in order to compute the weights w(p, q) where σ indicates the standard deviation of the noise and h is the filterling parameter set depending on the value of σ.The weight function is set in order to average similar patches up to noise.That is, patches with square distances smaller than 2 2 are set to 1, while larger distances decrease rapidly accordingly to the exponential kernel.The weight of the reference pixel p in the average is set to the maximum of the weights in the neighborhood B( p, r).This setting avoids the excessive weighting of the reference point in the average.

Adaptive Gaussian Threshold
To improve the accuracy of character recognition, we binarize the input image.When we create a binarized image from a grayscale image, threshold processing is necessary.But binarization with a single threshold can be suboptimal, particularly if the page backgrounds is of uneven darkness.Thus, we use an adaptive Gaussian threshold with the weighted average value of the close region of the target pixel as threshold.Gaussian function is used for weighting.
Where σ denotes the standard deviation.Figure 2 and Figure 3 depict threshold processing with a single threshold and Gaussian threshold.Figure 4 shows the image after denoising and binarization.

Experimental Results and Discussion
In this paper we created OCR training data for Tesseract by using Figure 1 and Figure 4, and we compared the recognition results when using two training data.
Figure 6 shows the results of character recognition without denoising and binarization.Among the 17 character strings included in Figure 6, the only one character string "INDICATOR 800*1100" was correctly recognized without any mistakes.On the other hand, In Figure 7, "INDICATOR 800*900", "INDICATOR 800*1100", and "INDICATOR 275(25)" were correctly recognized.I think the reason for this is because denoising using non-local means and binarization using Gaussian threshold make characters in the input image clear, thus OCR software can recognize characters more correctly

Conclusions and Future Work
In this paper, we employee the denoising and binarization technique to improve the OCR accuracy.Also, training the data using deformed printing characters.The results show that recognition accuracy improves by using the proposed approach.However, the results are evaluated only in a constrained circumstance and needs more example data to improve the efficiency.
In the future, we are planning to implement a system to correct the miss-recognized character strings.In particular, we think a correction system by employ the feature matching technique to the character image.

Figure 5 :
Figure 5: Select character area Second, we decide what character is included in the selected area and create box file.Box file is consists of the characters, coordinates, heights, and widths.And then, we create training data using box file.