Multi-maxpooling Convolutional Neural Network for Medicinal Herb Leaf Recognition

This paper evaluates the leaf recognition performance using basic Convolutional Neural Network (CNN) and two popular pre-trained CNN namely, AlexNet and GoogleNet. CNN is known to be dependent highly on large amount of training data but the result achieved is tremendously accurate. This research tends to find an alternative to obtain excellent results but with low amount of training data. Publicly available leaf dataset that is, Leafsnap, is being used in this experiment and data augmentation has been applied to relatively increase the number of training data. The dataset consist of images in different sizes, orientations and shapes. Thus, a robust approach is utilized by adding multi-maxpooling layers before the fully-connected layers of the basic CNN. The experimental results indicate that the performance of basic CNN can be improved by applying multi-maxpooling layers where the accuracy results produced are nearly as good as AlexNet or GoogLeNet but at a faster rate.


Introduction
Medicinal plants are available everywhere throughout the world including Malaysia but not many people recognize them and know about their users especially for medical purposes.Preservation of the knowledge of these medicinal plants is important since it enables the general public to gain useful knowledge which they can apply when necessary.Various researches have been conducted to recognize plants and the main part of the plant that is popular to be used for plant recognition is the leaf (1)(2)(3)(4)(5) .Leaf identification can be performed using two approaches, namely handcrafted approach and deep learning.
In handcrafted approach, suitable features and classifiers need to be determined prior to applying them for object recognition.From out previous research 5 , texture features such as Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP) and Speeded-Up Robust Feature (SURF) produce very promising results for leaf recognition.A combination of texture, color and shape features for leaf recognition also achieve very good accuracy results 3 .However, identifying the good feature and classifier for any object recognition may be time consuming since various experiments on the features need to be conducted in order to achieve the desired accuracy.Furthermore, prior knowledge on the images such as shape, colors, variation of shapes due to shape deformation or different categories, and variation in colors due to illumination or lighting changes also need to be considered so that they will not reduce the recognition results.
In recent years, Convolutional Neural Network (CNN) has gained popularity in computer vision applications including plant and leaf recognition 1,[6][7] .The application of deep layers in CNN is possible due to the availability of vast amount of data and powerful computer devices.Information on feature and classifier is not needed when applying CNN.On the other hand, large amount of data consumes a long training time.
To overcome this limitation, utilizing pre-trained CNN models such as AlexNet 11 and GoogLeNet A basic CNN model consists of three basic layers that are convolve layer, ReLu layer and pooling layer.These three layers are called a stack.Having more stacks in a basic CNN may increase the accuracy performance and processing time.Thus, to avoid having a deep CNN with a long processing time, multi-pooling approach has been proposed for multi-font printed Chinese character recognition before the classification layer since this approach is robust to spatial layout variations and deformation in the shape of the characters 9 .This paper tends to apply similar multi-pooling 9 approach since the leaf images also consist of different spatial layout and shapes.
This paper is organized as follows.Some related work on leaf and plant recognition is explained in Section 2. Experimental results are being discussed in Section 3 while section 4 concludes this research.

Related Work 2.1 Convolutional Neural Network (CNN)
The architecture of a typical CNN is structured as a series of layers.A stack of CNN consists of three layers that are convolutional layer, Rectified Linear unit (ReLu) layer, pooling layer and followed by fully connected layer (classification layer) 8 .The convolutional layer extracts features of an image by using a filter that stride over the input image and produces a feature map.Different filter produces different feature map that act as feature detectors.Multiple convolutional layers can form different feature maps to ensure full extraction of various features.ReLu layer replaces all negative pixel values in the feature map with zero.A pooling layer down-sizes the feature map after ReLu layer to reduce the dimensionality.A typical pooling layer is max-pooling that computes the maximum of a local feature map.An average pooling layer takes the average of a local feature map.Neighboring pooling takes input from feature maps that that are shifted or stride by more than one row or columns.This operation reduces the dimension of the feature maps and act as invariance to distortion or small shifts.Fully connected layer performs the classification process.A simple illustration of the CNN layers is shown in Fig. 1.
A pre-trained CNN model like AlexNet, also called transfer learning model, is where knowledge is learnt from training large amount of datasets.It won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012.AlexNet consists of 25 layers that combine a few stacks of convolutional layers and fully connected layers 11 .An illustration of AlexNet layers is shown in Figure 2.  .GoogLenet is the winner of ILSVRC 2014 competition.It achieved a top-5 error rate of 6.67% 12 .This was very close to human level performance which the organizers of the challenge were forced to evaluate.It has been realized that it was actually a difficult task even for a human to perform the image recognition and required some human training in order to perform the task.The human expert (Andrej Karpathy) was able to achieve a top-5 error rate of 5.1% (single model) and 3.6% (ensemble).The network used CNN inspired by LeNet but implemented a novel element which is dubbed an inception module.It used batch normalization, image distortions and RMSprop.This model is based on several very small convolutions in order to drastically reduce the number of parameters.Their architecture consisted of 22 layers but the number of parameters is reduced from 60 million (AlexNet) to 4 million (GoogLeNet).Figure 3 illustrates the different layers of GoogLeNet 12 .Fig. 3. Illustration of the layers of GoogleNet 12 .

Multi-maxpooling Shallow CNN
The architecture of the CNN used in this study is shown in Fig. 4. It consists of two stacks of layers followed by additional maxpooling layers, fully-connected layer, softmax layer and classification layer.Each stack of layers consists of one convolutional layer, one ReLu layer and one maxpooling layer.The first convolutional layer in the first stack filters the 227x227 input images with 16 kernels of size 9x9 with a stride of 3 pixels.The second convolutional layer in the second stack takes the output of the first maxpooling layer and filters it with 10 kernels of size 9x9 with a stride of 2 pixels.Additional maxpooling layer is being added after the second stack and the accuracy performance is recorded.Fig. 4. Illustration of CNN architecture used in this study.

Leaf Dataset
There are a few publicly available leaf datasets that have been tested by researchers such as Flavia 10 and Leafsnap 2 .Flavia dataset consists of 33 species and the leaf images were captured with white background.Fig. 5 illustrates some sample images from Flavia dataset.Fig. 6 illustrates some sample images from Leafsnap dataset where it consists of 52 species.This research tests the images of the leaves from Leafsnap dataset since they have more number of species and they are more challenges since there are variations of images even though they come from the same species The Leafsnap dataset consists of images with different sizes and orientations, with some deformed images and shadows.
Fig. 7 shows some sample images from Leafsnap dataset from two different species.By looking at Fig. 7, we can see that a robust approach is highly needed in order to perform the leaf recognition since the leaf images from the same species have different shapes.The images were captured with different background color as it represents the natural background that may be in various colors due to illumination changes.Some of the images also consist of slight shadows.Data augmentation is performed on each species where they are being transformed into various sizes and orientations to produce a total of 5200 images.

Experimental Results and Discussions
The experiment is conducted using Matlab 2018a where eighty percent of the images were randomly selected for training and the remainder is for testing.The leaf images were resized to 227x227 pixels.Since AlexNet has less number of layers compared to GoogLeNet, it produces faster result compared to GoogleNet.But, GoogLeNet achieves slightly more accurate result compared to AlexNet.The results are shown in Table 1.
Table 1 also illustrates the experimental results for basic CNN with the addition of maxpooling layers.The numbers in the square bracket represent the parameter values of convolve layer and maxpooling layer.The additional values outside of the square brackets are the values of the extra maxpooling layers that are being added to the basic CNN.A few filter sizes and stride values have been tested but the values in Table 1 were the ones that produce significant results based on our experiments and the memory constraints of our computer.The accuracy is computed as the sum of the correct predicted images divided by the sum of the total tested images.
By looking at Table 1, we can see that the results for 1 stack show an increase in the accuracy as the number of maxpooling layer is added to the CNN but it begins to decrease when the maxpooling layer is 5.When another stack is being added along with the additional number of maxpooling layers, the values of the accuracy increases.The best accuracy results obtained is 0.9702 which is relatively close to the results produced by AlexNet and GoogLeNet, and yet at a much faster rate.This shows that the additional maxpooling layers can still recognize the leaf images with various orientations and deformations as good as AlexNet and GoogLeNet.The results also illustrate that a small size of pooling layer is robust to various shapes and layout.
This research has also captured ten different images from 42 different species of the Malaysian medicinal herb plants.Then, data augmentation is performed where the images were transformed into various sizes and orientations to produce 4200 images.Fig. 8 illustrates some sample images of the Malaysian medicinal herb leaves from three species.
AlexNet, GoogLeNet and basic CNN with multi-maxpooling layers have been applied to our own leaf dataset and the results produced were 0.91, 0.99 and 0.90 respectively.
This shows that the additional multi-maxpooling layers are robust and can achieve accurate results as the pre-trained CNN models.

Conclusions
In this paper, we propose additional multi-maxpooling layers before the fully-connected layer that makes the basic CNN robust to images with different orientations, shapes or formations and even with shadows.The results indicate that very good accuracy performance can still be achieved with less number of training images and shallow layers utilizing basic CNN with multi-maxpooling.Future work is to test with other datasets and apply such approach for mobile devices.

Fig. 2 .
Fig. 2.An illustration of the architecture of AlexNet 11 .. GoogLenet is the winner of ILSVRC 2014 competition.It achieved a top-5 error rate of 6.67%12 .This was very close to human level performance which the organizers of the challenge were forced to evaluate.It has been realized that it was actually a difficult task even for a human to perform the image recognition and required some human training in order to perform the task.The human expert (Andrej Karpathy) was able to achieve a top-5 error rate of 5.1% (single model) and 3.6% (ensemble).The network used CNN inspired by LeNet but implemented a novel element which is dubbed an inception module.It used

Fig. 7 .
Fig. 7. Some sample leaf images with different orientations and shapes from two different species from Leafsnap dataset 2 .

Fig. 8 .
Fig. 8.Some sample images of Malaysian medicinal herb leaves from three species.

Table 1 .
Accuracy results for leaf recognition for Leafsnap dataset.