On Comparing Feature Reduction Techniques for Accuracy Improvement of the k-NN Classification

The aim of this paper is to perform a comparative study of feature reduction techniques that are most appropriate for the classification with k-nearest neighbor and tested with medical data. Medical data are normally high-dimensional in their nature. Their high dimensionality property can affect performance of the classification process. In this work, we perform various feature reduction techniques implemented with Matlab to decrease dimensions of data before the knearest neighbor classification step. From the experimented results, we found that best performance is obtained from using the PCA algorithm to reduce features of data. The comparison in terms of accuracy turns out that PCA and ROC feature reduction techniques can improve the classification prediction, whereas the t-test feature reduction has very limited effect over the classification accuracy.


Introduction
Techniques to reduce the feature of data is one step in the data preprocessing it have important with pattern recognition and high-dimension data.This techniques can be remove attributes without affect the performance of algorithm.Because some attributes may affect the performance of algorithm.That is attributes may not be related to class label of data.We use feature reduction techniques for remove irrelevant attributes.
k-Nearest Neighbor is most popular in the fields of pattern recognition and machine learning.The algorithm is based supervised learning.It is finding nearest of k data.
In this paper, we are used feature reduction techniques as PCA, t-test and ROC in first step for improve effect classification with k-nearest neighbor algorithm.The experimental result to compare the accuracy from experiments with each techniques, error and number of attributes of classify.As medical data is a very high number of attributes to suitable test data in evaluation.

Related works
Feature reduction is important in data preprocessing and widely used because it can extract important part.Related work use of this techniques as Mauricio Villegas and Roberto Paredes (1) used dimension reduction techniques with LDPP for test data from UCI and estimation error of k-NN, the result concluded that use LDPP can optimize k-NN.Phattrawut Sangsiri et al (2) .compare the performance of dimension reduction with PCA and BFSF then the artificial neural network to predict the cancer data, the result BSFS techniques appropriate to input neural network.Deqing Wang et al. (3) use t-test algorithm to feature selection, then use k-NN and SVM test performance with text data.D.A. Adeniyi et al (4) .they present automatic web usage data mining with k-nearest neighbor.The classification method has been trained to be used on-line and in real-time to identify clients/visitors click stream data.Yi-Hung Kung et al (5) .they find that the asymptotically optimal linear combination of nearest neighbors for density estimation is just the last term of the linear combination.Thananan Prasartvit et al (6) .proposed improved method for data dimensionality reduction called ABC-kNN.The method uses the wrapper technique for classification.The Principal Component Analysis is used reduce features for high-dimension data and the most popular multivariate statistical technique.It a significant linear component analysis (7) .If any component is more important to selected and number of components is less than or equal the number of original components.The algorithm consist eigenvectors and eigenvalues.

Feature Reduction and Classification
Fig. 1 shown plotting the eigenvectors of covariance matrix (blue line) and point of all data before reduce features.And when used PCA to reduce features of data in Fig. 2.

(b) T-test
The T-test (8) algorithm is statistical test to compare the difference of the mean values in two group.By compute t0 value of two features from equation (1).Compute degree of freedom (df) value and define rho ()Correlate df value and rho with t-table is t value.Then consider that reduce features from checking t0 and t value that difference or equal.T-test is based on a normal data distribution.
There are three types.
-One sample t-test -Independent sample t-test -Paired sample t-test Where XA is mean value of A sample, XB is mean value of B sample,   2 is variance of p sample and   is number sample of B sample.
(c) The Receiver Operating Characteristic Curve (ROC) The ROC is measure of the class-discrimination by look at plotting the relationship between the true positive rate and false positive rate.The method is measuring overlap of the data distribution two group.
Fig. 3 shown plotting graph of ROC curve by plot the corresponding AUC values of data before reduce features.AUC value of data is less than 0 which diagonal line is equal 0. And when reduced features with ROC algorithm the AUC value increase is more than 0 in Fig. 4.
For example, Table 1 shown sample the data consist 4 instance, 3 features and class label which each instance is numeric data.Then bring the data to preprocessing data step with using ROC algorithm for reduce features of data.When reduced features the number of features remain is 2 features in Table 2.The features in Table 2 is important features to class label of data.

k-Nearest Neighbor Classification Algorithm
The algorithm is based on supervised learning (9) .It find nearest k sample from training data and identified class of

Dataset
The dataset in experiment is dataset from UCI (10) .Which used 5 dataset as Breast Cancer Wisconsin, Parkinsons, Spect, Pima, Lungcancer.The Dataset is high-dimension and numeric data.The numbers features of each dataset shows in Table 3.

Experimental Results
The method implement with Matlab in experiment.Firstly, we are bring dataset to classify with k-nearest neighbor for the performance of classification on each dataset.The results shown in Table 4.

Classification Results over full-feature data
Table 4 shown the error values and accuracy values of classified and #features is the number of features used in classified.From the results is obtained the best accuracy is Breast Cancer Wisconsin as 0.905 and number of features to used is 30.
Next, we are used dimension reduction techniques to reduce features of each dataset for improve performance of classifying.The method use in prepare data before classify with k-nearest neighbor.

Classification Results with PCA Feature Reduction Technique
Table 5 shown the error values and accuracy values of classified and #features is the number of features reduced with PCA algorithm.From using PCA the number features of data less than the number of features in Table 4 and increases accuracy values.

Classification Results with T-test Feature Reduction Technique
-Define rho = 0.0001 Table 6 shown the error values and accuracy values of classified and #features is the number of features reduced with t-test algorithm and defined parameter rho is 0.0001.
-Define rho = 0.05 Table 7 shown the error values and accuracy values of classified and #features is the number of features reduced with t-test algorithm and defined parameter rho is 0.05.
From Table 6 and 7, the results is similar but it difference at number of features of lungcancer dataset and accuracy values from defined difference rho value.Which rho value is significance level (1-α) for hypothesis test.Generally used rho value is 0.05.

Classification Results with ROC Feature Reduction Techniques
Table 8 shown the error values and accuracy values of classified and #features is the number of features reduced  From compared results of using feature reduction techniques the best of techniques is PCA algorithm in reduce features of data.The PCA algorithm can reduced irrelevant features and optimize accuracy values.The experiment show that PCA can improve performance of mostly dataset.Which it can improve to the best of classifying on four dataset is about 80% from all dataset and rest dataset is Breast Cancer Wisconsin improved with ROC algorithm is about 20%.

Conclusions
This paper presents a performance comparison of feature reduction techniques to be applies prior to the classification with k-nearest neighbor.We used PCA, t-test and ROC algorithms to reduce features in data preparation step.The performance comparative result is that using PCA algorithm to reduce features can increase accuracy values and reduce the most significant number of features.PCA algorithm can improve accuracy in most dataset: Breast Cancer Wisconsin from 0.925 to 0.9085, Parkinsons from 0.8319 to 0.8352, Spect from 0.6643 to 0.7990, Pima from 0.7083 to 0.7565, Lungcancer from 0.5333 to 0.6667.We are planning to improve PCA-based feature reduction to be more appropriate for medical diagnosis in terms of model's understandability.

Fig. 3 .
Fig. 3.The sample of plot data with ROC Curve.

Fig. 1 .
Fig. 1.The sample of data in experiment with PCA.

Fig. 2 .
Fig. 2. The sample of data reduced with PCA.

Fig. 4 .
Fig. 4. The sample of plot ROC with reduced features.

Fig. 7
Fig. 7 and 8 shown chart conclude from all experimental results table by number 1 is Breast Cancer Wisconsin dataset, 2 is Parkinsons, 3 is Spect, 4 is Pima, 5 is Lungcancer.Fig. 7 is chart shown comparison accuracy values of each method.Firstly Classification with k-NN the number of features equal original dataset then used feature reduction techniques as PCA, T-test (T-test_1 is used rho 0.0001 and T-test_2 is used rho 0.05), Roc to reduced features.From Fig. 7 shown that using PCA can improve accuracy values of classification greater than or equal to other techniques.Fig. 8 is chart shown comparison the number of features.The k-NN used all features in classification and reduced features with feature reduction techniques.From Fig. 8 shown that using PCA can reduced

Table 3 .
The numbers of feature.

Table 1 .
The sample of data in experiment.

Table 2 .
The sample of data reduced feature.Fig. 6 shown all step of the working.Firstly prepare dataset.Second use dimension reduction techniques with existing dataset.Next classify with k-nearest neighbor.And then evaluate with cross validation.

Table 4 .
The result of k-NN classifier without any application of feature reduction.

Table 7 .
The classification results with T-test feature reduction (rho parameter = 0.05).