Optical Character Recognition using Global and Local Features with Modified K-means and Neural Network

This paper presents a hierarchical classification of Optical Character Recognition (OCR) using global and local features with modified k-means and neural network. A hierarchical classification strategy, which contains training and recognition phases, was designed to precisely recognize the characters. In the training phase, the global features of sample characters and modified k-means clustering are employed to quickly categorize the characters into several clusters. Then each cluster and its corresponding local features of sample characters are fed into the Back-Propagation Neural Network (BPN) to learn the optimal weights. In the recognition phase, both global and local features are extracted from the test characters. Then the character is recognized by well-trained clusters’ center and neural network for coarse and fine recognition, respectively. Experimental results of the three different datasets tested showed classification rates of 97.22%, 100% and 90.28%, respectively.


Introduction
Optical character recognition (OCR) system is widely applied in industrial applications (1) , and it has become one of the most essential applications of technology in the field of pattern recognition and artificial intelligence.It contributes significantly to the advancement of automation process and improves the interface between human and machine.In general, character recognition algorithm consists of two main stages: feature extraction and classification.
Both feature extraction and selection are very important in achieving high recognition performance of OCR system.Several feature extraction techniques for character recognition have been reported (2,3) .The identifiable features extracted from the segmented characters are improving the recognition rate and reducing misclassification.Besides feature extraction, feature selection also has tremendous effects on the identification result; therefore, the selected features must be sufficiently discriminable to each segmented character and powerful enough to classify all relevant classes correctly.In addition, the classification technique is also of vital importance in the OCR system.Many classification techniques have been introduced (4) , including statistical methods, Artificial Neural Networks (ANNs), Support Vector Machines (SVM) and multiple classifier combination.The strengths and weaknesses of each classification technique have also been discussed.Among several methods proposed, the artificial neural network has been most frequently used as a powerful tool for classification problems.
Sophisticated neural network classifiers and novel feature extraction techniques have been proposed for achieving high recognition performance (5)(6)(7)(8) .In (5) , different classifiers, including multiple Multilayer Perceptron (MLP), Hidden Markov Model (HMM) and structure adaptive self-organizing map (SOM) are employed to solve difficult problems such as handwritten numeral recognition.The results show that the multiple MLP classifier is more preferable than the others.For recognition accuracy, novel feature extraction techniques have been proposed (6,7) .In this study, features are obtained from three different orientations, horizontal, vertical and diagonal directions, and they obviously increase the recognition rate for handwritten character recognition system.Hybrid features are introduced to recognize English handwritten characters in (8) , and the hybrid features are obtained by combining the features extracted using diagonal, directional, Principal Component Analysis (PCA) and geometry feature extraction techniques.The hybrid features thus obtained are fed into the feed forward propagation neural network for classification tasks.In contrast, other methods involving different classification strategies are proposed.An excellent hierarchical classification was performed in (9) .In this approach, the features are extracted at different resolution levels and those features consider only the sub region in the image that meets certain acceptance criteria at higher level.In (10,11) , another hierarchical classification approach involving new structural features was introduced, in which features are extracted using recursive subdivisions of the character image in a hierarchical approach.Lower levels are utilized to perform preliminary discrimination whereas higher levels help distinguish between characters of similar shapes that are confusing when using only lower levels.As mentioned above, various features are computed and extracted in the OCR system, although it is possible that the same extracted features do not have any efficacy in the recognition stage.Therefore, redundant feature may degrade the recognition results, and increase the time complexity of the recognition process (12) .This paper proposes a hierarchical classification approach of OCR using global and local features with modified k-means and neural network.This approach combines four different features, namely, aspect ratio, skeleton area, geometrical distance and diagonal feature.The two-stage study is implemented using a coarse-to-fine scheme where characters are classified into a possible group in the first stage and the test characters are then correctly recognized in the second stage.Although the proposed approach regards only a few features as inputs of the neural network, high recognition accuracy is achieved.
The remaining sections of this paper are organized as follows.In Section 2, the proposed hierarchical OCR algorithm is presented.Section 3 describes the feature extraction methodology.Section 4 presents classification methods while experimental results are discussed in Section 5. Finally, conclusions are drawn in Section 6.

Proposed hierarchical OCR algorithm
The main procedure of the proposed algorithm is

Feature extraction
The most critical aspect of character recognition is how to select suitable features for feature extraction.The selected features should be distinct and reasonably invariant with respect to variations in character shape caused by the imaging environment.This study used four different approaches for extracting global and local features.In the recognition phase, global features are employed to achieve the preliminary discrimination in first stage of the proposed classification strategy.Relatively, local features are employed to precisely recognize the test characters.

Global feature
The global features are extracted using three methods, namely aspect ratio, skeleton area and geometrical distance, as shown in Fig. 2 and described below.Aspect Ratio: The aspect ratio describes the proportional relationship between the width and height of bounding rectangle of the character, which is invariant for characters of any size.The feature is computed as Skeleton Area: Skeletons are important shape descriptors in object representation and recognition.Typically, the skeletons of volumetric models are computed using iterative thinning.Fig. 2(b) shows the skeleton of a character, whose area is summarized in the pixel of skeleton.
Geometrical Distance: This property is a certain particular average distance of the image pixels at the bounding rectangle corners.Here, there are four corner points ( 1 P -4 P ) of the bounding rectangle, as seen in Fig. 2(c).The feature formula is defined in Eq. ( 2).
where n is the number of points labeled with object and ij S refers to the distance between the point i located in the object and the corner point j.

Local feature
The character image is resized to 30  40 pixels and the image is divided into

Modified k-means clustering
Cluster analysis is a useful technique of classifying large amount of data into subsets.Enormous data are thereby organized into an efficient and useful representation that characterizes the cluster being sampled.K-means is one of the most popular and efficient clustering methods that could minimize clustering error.Classic k-means starts with the initial number of cluster and the cluster centers; hence, the performance depends heavily on the initial conditions.To overcome these problems, the max-min principle is proposed to determine the initial conditions [13].In order to determine the number of clusters to be grouped, the maximum variation of standard deviation between a range of clusters is conducted to obtain the optimal number of clusters [14].In this paper, the implementation procedure of the modified k-means clustering is shown in Fig. 4. Here, the modified k-means clustering takes the advantages of those literatures, and it can provide more robust clustering result.In the modified k-means clustering procedure, the initial and maximum numbers of cluster are 2 and 9, respectively.Here, the classification performance will deteriorate if the maximum number of clusters is less than 2 or more than 9.
With the optimal clustering result obtained, the boundary layer of each cluster will be extended.This process is very vital to the correct recognition of test characters.After the clustering method, the original boundaries of each cluster are obtained.However, misclassification will occur when the global features of test characters are near the decision boundary in the recognition phase.Therefore, the original boundary is further spread by parameter  , as shown in Fig. 5, and the clusters are re-defined by the extension boundaries.Hence, if the data are located in the overlapping region, it will belong to both neighboring clusters simultaneously.
There are three kinds of features according to the structural feature calculated by global feature extraction (mentioned in section 3.1), the three-dimensional information is fed into modified k-means clustering to perform the coarse classification in the first stage.The details are illustrated as follows.
Step1: Set a range of k.In this study, k is set to range from 2 to 9.
Step2: Obtain the initial center of each cluster using Max-Min clustering algorithm.This method adopts the farthest distance of each data point to get the center of k cluster.
Step3: Decide which cluster each data point belongs to according to the minimum distance between data point and cluster center.
Step4: Recalculate the center of each cluster using an average of data points in the cluster.Repeat this step until the center values become constant.
Step5: Compute the standard deviation from each k cluster that would be recorded.The standard deviation k S formula is where k S is the standard deviation of k cluster , k represents the number of clusters, x is the feature point, v is the cluster center, n is the number of feature points.The standard deviation between k and l cluster kl S is calculated using Eq. ( 6), and l is k+1.
The optimal number of clusters is selected by l using the maximum of kl S as in Eq. (7).

Neural network (NN)
In this paper, the Multilayer Perceptron (MLP) of the neural network model is adopted.MLP is known as a supervised network because it requires a desired output to learn the optimal weights.A two-layered neural network with a single hidden layer is shown in Fig. 6.The network is fully connected between adjacent layers.The operation of this network can be regarded as a nonlinear decision-making process.The error back-propagation neural network (BPN) is a common method for training the weights between each layer.In this study, the gradient descent of BPN method is employed to determine the optimal weights.Given a known input   w is a weight from the th k hidden node to the th j class output, f is an activation function.
However, the results are passed through a nonlinear activation function, for the hidden and output units of the activation function are the sigmoid function in this study as shown in Eq. ( 9).The node having the maximum value is selected as the corresponding class. 1 () 1 In general, the number of available features and the number of classes are both large in the OCR system.However, a network of a finite size does not often load a particular mapping completely.Therefore, the specific part of complete mapping is considered for a single network, and it would perform its job better.A hybrid method, coarse-to-fine recognition, that considers both global and local features is developed.The modified k-means clustering is a coarse classification for categorizing the test characters into specific networks in the first stage, and the output of the specific network further classifies to correct recognition.The network architecture is shown in Fig. 7.

Experimental results
Two fonts, Times New Roman and Arial, were tested.The neural network was trained by using three datasets of 26 letters and 10 numbers (108 samples) for each font.In the recognition experiment, the test data make up four datasets (144 samples) that are generated by the morphology technique for each font.The three test  1, the optimal number of cluster is 3 for Times New Roman and 6 for Arial.Table 2 shows the details of the parameters of the neural network, and the parameters are fixed to test.The recognition rates obtained for the proposed classification are shown in Table 3.The cluster boundary layers are 0.25, 0.1 and 0.25 for Times New Roman, Arial cases and Times New Roman with Gaussian blur, respectively.As seen in Table 3, the proposed method has 100% recognition rate for Arial characters, and 97.22% recognition rate for Times New Roman characters and 90.28% recognition rate for Times New Roman with Gaussian blur characters.The misclassification characters for Times New Roman, shown in Fig. 9, are recognized as 'H', '1', '3' and '1', respectively.The other misclassification characters for Times New

Output layer j
Roman with Gaussian Blur are shown in Fig. 10.As can be seen, the test characters are incomplete and distorted, thus affecting significantly extraction of both global and local features.

Conclusions
In this paper, a hierarchical classification strategy using global and local features with the modified k-means and neural network was designed to precisely recognize characters.In the training phase, the number of clusters and the optimal weights of BPN are determined by the global and local features, respectively.In the recognition phase, the coarse-to-fine classification stage is employed to enhance the performance and the recognition rate of test characters with changes in thickness and deformation.The experimental result shows that the proposed method yields excellent recognition accuracy of 97.22%, 100%and 90.28% for Times New Roman, Arial and blur character, respectively.

LN
rectangular sub-images, as shown in Fig.3.The size of each sub-images is m × n pixels and it has (m + n)-1 diagonal lines in each sub-image.The sub-feature of each diagonal line () Di is obtained from the sum of all pixels along the corresponding diagonal line.Thus, (m + n)-1 sub-features are obtained from each sub-image, then sub-feature values are averaged to obtain a step is sequentially repeated for all the sub-images, r is from 1 to L N .There could be some sub-images whose diagonals are empty of foreground pixels, then the feature values corresponding to these sub-images are zero.Finally, L N features are extracted from each character.In this paper, the value of L N is 12 and both m and n are 10.

ForFig. 3 .
scheme is employed.Characters with similar shapes (i.e., '0' and 'O' and '2' and 'Z',) often causes confusion, and the proposed two-stage classification is designed to solve this problem.In the first stage, the modified k-means clustering

Table 2 .
Details of the proposed neural network.

Table 3 .
Performance of the proposed classification.

Table 1 .
Results of the modified k-means.
k Times New Roman 3 Arial 6 Fig. 10.Misclassification of Times New Roman with Gaussian Blur.