An Empirical Study of Distance Metrics for k-Nearest Neighbor Algorithm

This research aims at studying the performance of k-nearest neighbor classification when applying different distance measurements. In this work, we comparatively study 11 distance metrics including Euclidean, Standardized Euclidean, Mahalanobis, City block, Minkowski, Chebychev, Cosine, Correlation, Hamming, Jaccard, and Spearman. A series of experimentations has been performed on eight synthetic datasets with various kinds of distribution. The distance computations that provide highly accurate prediction consist of City block, Chebychev, Euclidean, Mahalanobis, Minkowski, and Standardize Euclidean techniques.


Introduction
Data mining is the extraction of knowledge hidden in the data.Data mining is often done with the large datasets.The knowledge from data mining has been used in various fields, such as prediction over future situation, assisting in medical diagnosis, forecasting relation of chronology.
Current data mining methodology has been classified into several tasks, such as classification, clustering, and association mining.Data mining each for task will have a different purpose.Classification task will be trying to classify data with high accuracy for classifying future example, such as trying to distinguish between patients with heart disease and those who are healthy.Clustering task will try to categorize groups of data such that data in the same group look similar, whereas they are dissimilar to others in different groups.Association mining task will try to find rules that represent relation between data with some support and confident values.
Classification task of data mining can be done with many algorithms such as k-nearest neighbor.Beyer (1) explained the significance and origin of the nearest neighbor.Cover (2) used k-nearest neighbor to classify data.Dudani (3) did research about weighting of distance matrix values with k-nearest neighbor.Fukunaga (4) developed techniques for running k-nearest neighbor faster.Keller (5) developed new algorithm named "Fuzzy K-Nearest Neighbor" based on k-nearest neighbor with the purpose to use it with fuzzy task.Köhn (6) used city-block distance matric to increase performance of k-nearest neighbor algorithm.
This research also studies classification technique with a specific interest in the k-nearest algorithm.We aim to analyze the performance of different distance metrics to finally choose a proper metric that makes a good classification performance.In this research use 8 synthetic datasets with different distribution, and a dataset for each distribution has 2 classes but has different amount of data in each class.This is to test the impact about amount in each class on the performance of classification.
The rest of this research is organized as follows: Section 2 gives details of the k-Nearest Neighbor and the computation of each distance metric.Section 3 gives details of our proposed method.The experimental results and analysis will be presented in Section 4. Finally, the research is concluded in Section 5.

k-Nearest Neighbor
The k-nearest neighbor is a semi-supervised learning algorithm such that it requires training data and a predefined k value to find the k nearest data based on distance computation.If k data have different classes, the algorithm predicts class of the unknown data to be the same as the majority class.For example, to find the appropriate class of new datum using the k-nearest neighbor algorithm with a Euclidean distance metric, the concept can be shown in Fig. 1.
Fig. 1 shows the classification of iris data.The point to be classified is (5, 1.45), which is shown with "X".When applying k-nearest neighbor algorithm with k = 8 using Euclidean distance computation, the result is shown with a radius of dot line.It has two possible classes: virginica class with two instances and versicolor class with six instances.This algorithm will classify mark "X" to the class of versicolor because versicolor class is the majority of data within the radius.

Distance Metrics
Distance metrics are a method to find distance between a new data point and existing training dataset.In this research, we experiment with 11 distance metrics, which can be explained as follows.Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors x1, x2, ..., xmx, and my-by-n data matrix Y, which is treated as my(1-by-n) row vectors y1, y2,...,ymy, the various distances between the vectors xs and yt are defined as follows: 1. Euclidean Distance The Euclidean distance is a measure to find distance between two points, defined by Eq. ( 1) The Euclidean distance is a special case of the Minkowski metric, where p = 2.

Standardized Euclidean Distance
The standardized Euclidean distance is used to optimize the problem of finding the distance, defined by Eq. ( 2) where V is the n-by-n diagonal matrix whose jth diagonal element is S(j) 2 , S is the vector containing the inverse weights.

Mahalanobis Distance
The Mahalanobis distance is a measure between a point and a distribution of data, defined by Eq. ( 3) where C is the covariance matrix.

City Block Distance
The city block distance between two points is the sum of the absolute difference of Cartesian coordinates, defined by Eq. ( 4) The city block distance is a special case of the Minkowski metric, where p = 1.

Minkowski Distance
The Minkowski distance is a method to find distance based on Euclidean space, defined by Eq. ( 5) For the special case of Minkowski distance p = 1, the Minkowski metric gives the city block distance, p = 2, the Minkowski metric gives the Euclidean distance, and p = ∞, the Minkowski metric gives the Chebychev distance.

Chebychev Distance
The Chebychev distance is a measure to find distance between two vectors or points with standard coordinates, defined by Eq. ( 6) The Chebychev distance is a special case of the Minkowski metric, where p = ∞.

Cosine Distance
The Cosine distance is computed from one minus the cosine of the included angle between points, defined by Eq. ( 7) 8. Correlation Distance Distance based on correlation is a measure of statistical dependence between two vectors, defined by Eq. ( 8) where

Hamming Distance
Hamming distance, which is the percentage of coordinates that differ, can be defined by Eq. ( 9) 10. Jaccard Distance Jaccard distance is computed from one minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ, defined by Eq. ( 10) 11. Spearman Distance Spearman distance is computed from one minus the sample Spearman's ranked correlation between observations, defined by Eq. ( 11)

Empirical Study Methodology
In this section, we present our study framework using k-nearest neighbor algorithm with various distance metrics.The framework is shown in Fig. 2. From Fig. 2 the detail of each step can be explained as follows: Step 1: Generate binary data set with different distribution and different amount of data in each class.Then split data around 70% for training set and 30% for test set, which will be used for testing the performance of classification.
Step 2: Use data from step 1 for data classification by applying the k-nearest neighbor algorithm with various distance metrics to compute the k-nearest data points for making classification.
Step 3: Analyze the results and conclude about the performance of classification using various distance metrics.

Datasets
For our experiment, the proposed framework has been applied for classifying binary synthetic datasets.We generate eight synthetic datasets, each dataset has four different distributions, and each distribution has two of data in which class the amount of data in each class is varied.Each dataset has in total 5000 instances, and three features.We use MATLAB program to generate synthetic datasets.Details of the synthetic datasets are given in Table 1.Fig. 3 illustrates an overview of synthetic datasets.

Experimental Results
The results from the proposed study framework for eight synthetic datasets have been shown in Figs. 4 and 5.The data classification has been performed with the same algorithm (that is, k-Nearest Neighbor) and the same parameter setting.The only varied factor is a distance measurement.It turns out that the Hamming and Jaccard distance metrics perform badly on 4 out of 8 datasets.

Conclusions
The results of this research showed accuracy of k-nearest neighbor classification algorithm with different distance metrics.Experiments had been performed on eight synthetic datasets generated by MATLAB.The synthetic datasets have four distributions and have been split 70% to training set and 30% to test set.The results of classification over datasets in which amount of data in each class is equal showed that the Hamming and Jaccard techniques are low accuracy, while the other distance computation techniques have similar accuracy.The synthetic datasets in which amount of data in each class is different such as dataset 2, 4, 6 and 8 showed that the Hamming and Jaccard techniques are increasing in their classification accuracy.We can conclude that Hamming and Jaccard techniques are affected by the ratio of members in each class, while the other techniques are not affected by such phenomenon.

Fig. 3 .
Fig. 3. Distribution of the eight synthetic datasets, each one has four kinds of distribution.

Table 1 .
Details of synthetic datasets.
The highest accuracy on classify data with k-Nearest Neighbor is obtained from the six distance metrics, that are