Performance Analysis of Classifiers with Feature Selection and Optimization in CBIR System for Biological Images

The core objective of this paper is to improve the performance of Content Based Image Retrieval (CBIR) system for biological images by intelligent selection of most discriminative set of features from the canonical set of features. The performance of the CBIR system can be further enhanced by proper selection of a classifier and fine tuning of model parameters to obtain improved classification accuracy. We extracted canonical set of feature vectors from biological images using a popular tool called Weighted Neighbor Distance using Compound Hierarchy of algorithms Representing Morphology (Wndchrm). We adopted a two step approach for the selection of features. The first step is to partition the canonical set of feature vectors into four distinct set of feature vectors based on the methodology and algorithms for extraction of features. The second step is to perform Principal Component Analysis (PCA) and Fisher Score based selection of features from the partitioned set of feature vectors. The optimum set of feature vectors thus obtained is applied as training patterns to different classifier implementations such as Bayesian and Support Vector Machine (SVM) Classifiers. We used IICBU-2008 benchmark dataset [5, 7] of biological images for our experiments. We also compared the results with the results available for the wndchrm classifier. The results show that careful selection of the features and optimization of the classifier performance will lead to efficient implementation of CBIR systems.


Introduction
Content based image retrieval (CBIR) systems are becoming popular in these days.These systems help in retrieving the desired images from a database containing fairly large number of multiple category images.The schematic of a typical CBIR system is shown in Figure 1.Most of the CBIR systems make use of relevance feedback to improve the system performance.We consider the design of CBIR system for biological images.An exclusive CBIR system for biological images will really help the scientific community in great deal in analyzing, storing and retrieving biological images.We consider a two-step process for the optimization of the classifier performance in CBIR system.The first step is to select features from a canonical set of features to obtain partitioned sets of feature vectors.The selection is based on the methodology and algorithms for extraction of features.The second step is to select optimum features using principal component analysis (PCA) and feature selection methodology that make use of Fisher score.After performing the above mentioned feature extraction and selection processes, we applied these patterns to different classifier implementations such as Bayesian classifier and support vector machine (SVM) based classifier.We analyzed the performance of these classifiers by proper selection of the model parameters.The remaining part of this paper is organized as follows: Section 2 describes various methodologies adopted for feature extraction and selection.In section 3 we explain the basic theory associated with the implementation of Bayesian, SVM based classifiers.We also give an overview of wndchrm [2,3] classifier.In section 4 we present the details of the experiments and the subsequent observations.We make a conclusion of our discussions in section 5.

Canonical Set of Feature Vectors
Wndchrm [2,3] is an open source utility used for biological image analysis.The software enables the user to extract the features from an image.Wndchrm have two options that enable the user to select between extracting a smaller set of features containing 1025 features and a larger set containing 2659 features from an image.

Figure 1. Schematic of a general CBIR system
The canonical set of feature vectors extracted by wndchrm contain Radon transform features [4], Chebyshev transform features, Gabor filter features, features extracted from multi scale histogram, first four moments, Tamura texture features of contrast, directionality and coarseness, edge and object statistics features, Zernike, Haralick features and Chebyshev Fourier features.

Selection of Features
In this paper we select 4 sets of feature vectors for each image from the original canonical feature vector set.They are: 1) Set 1: Basic feature set include Radon transform, Chebyshev transforms, Zernike features, Haralick features etc.
2) Set 2: Transform of features such as Wavelet transform of Chebyshev features, Fourier of Haralick features etc. 3) Set 3: Transform of transform of features such as Fourier transform (Wavelet transform ( Haralick features)) etc. 4) Set 4: Edge features and transform of Edge features.After partitioning the canonical feature vector set into 4 distinct feature vector sets, we further optimize the number of features using principal components to analysis (PCA) and Fisher score based methods.

Principal Component Analysis
Principal Component Analysis describe a methodology for finding low dimensional representation (or projections) of high dimensional data that still maintain and capture most of the variability and discriminant information in the original representation.We summarize the steps for finding principal components.1) Find the mean (µ) of the feature set ({ } = ) where N is the total examples in the training set.
3) Find the covariance matrix of the entire training data.4) Find the Eigen values and vectors corresponding to the covariance matrix.5) Sort the Eigen value in descending order.6) Set a threshold for the Eigen value and take the Eigen vector above this threshold value.Project the data in the direction of the Eigenvectors to get reduced dimensional representation.

Fisher Score based Feature Selection
The objective of finding Fisher score is to obtain a subset of features, such that in the data space spanned by the selected features, the distance between data points in different classes is maximized, while the distances between the data points in the same class is maintained as small as possible [10].Fisher score is evaluated using the equation given below [10].
Where is the mean of ℎ feature for the examples of the ℎ class, and is the number of examples in ℎ class.and denote the mean and standard deviation of the ℎ feature over the whole data set.The total variance for the ℎ feature is given by = , where is the standard deviation of the ℎ feature for the examples of the ℎ class.The total number of classes is denoted by .

Bayesian Classifiers
Approaches based on statistical pattern classification make use of Bayesian decision theory [6].This approach is based on making decisions by using probability and by minimizing the risk involved.Bayes formula can be expressed informally as: Pattern classifiers can be implemented using different methodologies.An efficient way of building a classifier is by using a set of discriminant functions for = , . ., C. The classifier is said to assign a feature vector x to class if ( ) > ( ) for all ≠ [6].
We assume that the density distribution of the data set is bivariate Gaussian.Under this assumption, the use of the discriminant functions can yield minimum-error-rate classification.The discriminant function can be evaluated by substituting ⁄ by the multivariate normal distribution.Then

Case 1: ∑ =
In case 1 the features are assumed to be statistically independent, under this assumption, each feature has the same variance, .In this case the covariance matrix is diagonal, being merely times the identity matrix I. Geometrically, this situation corresponds to the condition that the samples fall in equal-size hyper-spherical clusters, the cluster for the ℎ class being centered about the mean vector .Since both |∑ | and the π term are not dependent on i, they can be ignored.We can deduce the simple linear discriminant function as: = + , where = and = -+ ln P( ) If prior probabilities P ( ) are the same for all 'C'classes, then the ln P ( ) term can be ignored.Then, the optimum decision rule can be stated as: to classify a feature vector x, measure the Euclidean distance ‖ − ‖ from each of the C mean vectors, and assign class label to x to the class of the nearest mean.This type of classifier is called a minimum distance classifier [6].

Case 2: ∑ = ∑
Case 2 deals with the condition that all the classes are having the same covariance matrix that is otherwise arbitrary.This leads to the geometrical situation that the samples fall in hyper ellipsoidal clusters of equal size and shape, the cluster for the ℎ being centered about the mean vector .Since both |∑ | and the ln 2π term are no dependent on i, and can be ignored, thus leads to the discriminant functions.
If the prior probabilities P i are the same for all 'C'classes, then the P term can be ignored.In this case, the optimal decision rule can once again be stated as: to classify a pattern , measure the squared Mahalanobis distance − ∑ − − from to each of the 'c' mean vectors, and assign x to the category of the nearest mean [6].The discriminant functions are again linear and given as follows: = + , where = ∑ − and = -∑ − + ln P( ) The most general case for the assumption of multivariate normal density, the covariance matrices are different for each category.In this situation, only ln 2π term that can be dropped which will lead to quadratic discriminant functions as follows: The decision surfaces are of complex shapes which include hyper quadrics, hyper spheres, hyper ellipsoids, hyper paraboloids, and hyper-hyperboloids of various types [6].

Support Vector Machine based Classifiers
SVM is a popular binary learning machine with some highly desirable properties.The concept of SVM based on two mathematical operations [9]: (1) The mapping of the training patterns using a non-linear mapping into a high dimensional feature space that is hidden from both input and output.(2) Construction of a maximum margin hyper-plane for separating the patterns obtained using the non-linear transformation in step 1.The separating optimal hyper-plane is completely determined by the support vectors which lie on the marginal hyper-planes.Thus, SVM gives an analytical approach for finding the optimum separating hyper-plane [9].For a given training set = { , } = N where is the input pattern for the ℎ example and is the corresponding desired response.SVM construct the decision surface such that the margin of separation between positive labeled and negative labeled examples is maximized.The equation for optimum hyper-plane for linearly separable patterns is given by + = 0, where is the optimum weight vector, x is the input pattern and is the optimum bias [9].
In the case of non-linearly separable classes, the margin of separation is considered to be soft if data pattern x ∈ , = N violates the condition + = .The violations are of two types.The data point x ∈ , = N falls inside the margin of separation and is on the correct side of the decision surface.Here, there is no misclassification.If the data point falls inside the margin of separation and is on the wrong side of the decision surface, then this leads to misclassification of the pattern.So, a new set of non negative scalar variable known as slack variables { } = N are introduced.Now the hyper plain will be + = 1-, where measures the deviation of the data point from the ideal condition of pattern separation [9].SVMs can also be thought as kernel machines that can use different Mercer kernels.These include linear , Polynomial ( + ) where q is the degree of the polynomial and Gaussian kernel −(‖ − ‖ ) , where the feature vector x is of d dimension.matrix be K where the ℎ element is denoted as ( , ).SVM make use of the kernel trick so that explicit computation of the transformation is not required for the computation of the kernel gram matrix for the training data.
The quadratic optimization problem for the soft margin SVM can entirely be expressed in terms of kernel gram matrix as follows [9]: Subject to the constraints: Where C is a user defined parameter and , , . ., are the Lagrangian multipliers that correspond to the inputs , , . ., N .Solving the above optimization problem yield .The optimum weight vector and optimum bias can be computed so as to obtain the equation of decision surface with maximum margin + = 0 for i = 1, 2, .., N.

An arbitrary test example is classified with the help of the decision rule
SVM Torch [8]: This is a freely available tool for SVM classification.It can also be used for regression problems.With this algorithm one can now efficiently solve large scale problems containing tens of thousands of patterns and even for input dimension higher than 100.It is also suitable for multiclass problems.SVM torch provides several options for the user to select the type of kernels, parameters of kernels, etc., SVM torch is the training machine and SVM test is the testing machine.The command line used for training data is: .

/svm torch [options] Example_file model_file
The command line used for testing the data is The open source utility called Wndchrm [2,3] is a powerful tool for biological image analysis that can be used on a wide range of biological datasets.Wndchrm has an in-built image classifier based on the variation of nearest neighbour classification.In this classifier, the distance of a pattern x extracted from a test image to a class 'C' is computed as follows [2,5]: where is the training set of class `C, t is a feature vector from , | | is the dimension of the feature vector x, is the value of feature j in the vector x, is the Fisher score of feature j, | | is the number of training examples of class 'C', and p is the exponent which is set to -5, which has been empirically determined [2,3].

Experimental Studies
We conducted experiments with IICBU-2008 benchmark dataset [7].We used Intel Pentium i3 core processor with 2 GB RAM and 500 GB hard disk with 2.3 GHz clock speed.The features are extracted using wndcharm tool.The command used to extract features in wndcharm is [2] Wndchrm train [options] images feature_file.The resulting image feature values are store in feature_file, images is a path to the top folder that contain all the images of the data set and [options] field is for specifying the optional switches .Images belonging to different classes can be kept in separate directories inside the top folder.Wndchrm supports tiff and ppm image file formats.
We implemented and analyzed three variations of Bayesian classifiers.These variations are based on the approximations of covariance matrix of the training set.We named these classifiers as Bayesian-1, Bayesian-2 and Bayesian-3 respectively.The classification accuracies for these classifiers were experimentally determined for various sets of feature vectors.We implemented the SVM classifier using the tool SVM Torch [8].
The IICBU-2008 benchmark dataset [7] includes biological images of different subjects such as organelles, cells, tissues, and full organisms using different magnifications and different types of microscopy [5].The details of the data sets are given in Table I.An indicative representation of images from IICBU-2008 dataset is given in Figure 2. From Table II we observe that, feature set 2 (transform features) give fairly good classification accuracy compared with the other feature sets for Bayesian-2 classifier.It is also observed that SVM classifier has better classification accuracy than Bayesian classifiers.V give details about the performance of Bayesian-2 and SVM classifiers for reduced set of features obtained using PCA and Fisher score based methodologies respectively for Cho dataset.It can be seen that the classification accuracies for Bayesian-2 and SVM based classifiers are almost remaining stable for considerable reduction of the number of features in the representation of images.This is in fact a pleasant surprise from the point of view of the design of CBIR system for biological images.The images can be represented in terms of compact, most informative and discriminant features in the CBIR database there by reducing the computational complexity.This also helps to improve the retrieval speed of the CBIR system.It is observed that we can obtain a high degree of reduction in the number of features using PCA and Fisher score based methods.This will surely reduce the computational overhead and increase the speed and overall performance of the CBIR systems.

Table IV and Table
We analyzed the performances of Bayesian-2, SVM classifiers for data sets belonging to IICBU-2008 biological image data set.We used Gaussian kernel for the SVM implementation.We also compared the results with the available results of wndchrm classifier [2].The results are shown in Figure 3.It is observed that SVM classifier out performs the other classifiers.

Conclusions
We compared classification accuracies of Bayesian, SVM and Wndchrm Classifier for biological image data set.It has been observed that SVM classifier that uses Gaussian kernel yielded highest classification accuracy.We experimentally verified that extracting subsets of features from the canonical set of feature vectors and by the use of feature selection methods using PCA and Fisher score, one can speed up the retrieval process of CBIR system without much degradation in performance.

Figure 3 Performance
Figure 3 Performance Analysis of Bayesian-2, SVM and wndchrm classifiers for datasets belonging to IICBU-2008 Biological image dataset.

TABLE III COMPARISON
OF CLASSIFICATION ACCURACIES FOR DIFFERENT SETS OF FEATURE VECTORS FOR SVM CLASSIFIER.

TABLE IV VARIATION
OF CLASSIFICATION ACCURACIES FOR BAYESIAN -2 AND SVM CLASSIFIER WITH REDUCTION IN NUMBER OF FEATURES USING PCA FOR CHO DATASET.