The Clustering Validity with Silhouette and Sum of Squared Errors

The data clustering with automatic program such as k-means has been a popular technique widely used in many general applications. Two interesting sub-activity of clustering process are studied in this paper, selection the number of clusters and analysis the result of data clustering. This research aims at studying the clustering validation to find appropriate number of clusters for k-means method. The characteristics of experimental data have 3 shapes and each shape have 4 datasets (100 items), which diffusion is achieved by applying a Gaussian distributed (normal distribution). This research used two techniques for clustering validation: Silhouette and Sum of Squared Errors (SSE). The research shows comparative results on data clustering configuration k from 2 to 10. The results of both Silhouette and SSE are consistent in the sense that Silhouette and SSE present appropriate number of clusters at the same k-value (Silhouette value: maximum average, SSE-value: knee point).


Introduction
A clustering is to group data.Although the clustering is similar to the data classification in terms of data input, the clustering is learning without target class.The clustering algorithm forms groups based on object similarities (1) .The clustering was applied to many fields such as bioinformatics, genetics, image processing, speech recognition, market research, document classification, and weather classification (2) .In addition, the clustering was applied to document data analysis that was one of big data learning (3)(4)(5)(6)(7) .
There are various algorithms for the data clustering.But the most popular one is k-means algorithm.The k-means algorithm is very simple in operation and suitable for unraveling compact clusters and a fast iterative algorithm (8) .The principle of k-means algorithm has divide n objects from dataset for k clusters that used center-based clustering methods (2) .In addition, each cluster has represented by the means of objects (8) .Although k-means is a popular technique, k-means is not known the correct number of clusters a priori.Consequently, the main challenge for these clustering methods is in determining the number of clusters (2) .In general, the number of clusters has been set by users or archives from knowledge of research (1,(9)(10)(11) .
Fig. 1 shows the distribution of each cluster when k=3, and k=4.The researcher found that the determination of suitable k value is not clear as shown in fig.1a and fig.1b.As mentioned above about problem of clustering, there are various research for selecting an appropriate number of clusters (10)(11)(12)(13) .Each of the proposed technique is suitable for each of data distribution such as Gaussianity and non-Gaussianity (14) .Therefore, finding the correct k-value for clustering is still a fundamental problem of clustering methods (15)(16) .
In this research, we study the clustering validity techniques to quantify the appropriate number of clusters for k-means algorithm.These techniques are Silhouette and Sum of Squared Errors.The rest of this paper is organized as follows.Section 2 discusses related research.Section 3 contains a description of methodology.Section 4 presents the results of experiments.The last section contains conclusions.

Related Research
Rousseeuw (12) have proposed the concept of the monitoring cluster.In the research proposed for Silhouette technique, which is based on the comparison of objects tightness and separation.The silhouette can reflect the data is grouped that objects are organized into groups that match it.This is a tool to assess the validity of the clustering to be used for selecting the optimal k in the cluster.Kwedlo (17) have proposed the concept of problem solving in order to know the number of cluster using the Sum of Squared Errors (SSE).

Proposed Methodology 3.1 A Framework of Data Clustering and Validation Approach
The clustering is a data mining at an unsupervised learning technique (2, 17-20)   .The principle of data clustering that objects in the same cluster will have to look very similar, while objects in other similar less (17)   .There are various algorithms of clustering technique, for example, Basic Sequential Algorithms Scheme (BSAS), Partitioning Around Medoids Algorithm (PAM), Fuzzy c-Means Algorithm (FCM), k-means Algorithm (8) and so on.
The main steps in the work of the clustering has 5 steps (21)   .There are (a) set a number of cluster for clustering

k-Means Clustering
The k-means clustering is a technique that relies on the center of cluster.This is often represented by the average (Means) of cluster.The clustering measure the similarity of the group by iterating the measurement distance between each object and the center of each cluster (2) using Euclidean distance measuring.The k-means algorithm is an iterative algorithm which can be described by the following steps.(2) (d) If none of the cluster centroids have changed, finish the algorithm.Otherwise go to Step (b).

Silhouette Measure
The concept of Rousseeuw (12) is described as follows: the Silhouette is a tool used to assess the validity of clustering.The silhouette constructed to select the optimal number of cluster with a ratio scale data (as in the case of Euclidean distances) that suitable for clearly separated cluster.The clustering are considered average proximities as the two are dissimilarities and similarities, which work best in a situation with roughly spherical clusters.
Case #1 considered dissimilarities (12) .From fig. 3 described for take the object i in the data set, and assigned to cluster A, then define as follows: s(i) = in case of dissimilarities.i = object i belong to cluster A. a(i) = average dissimilarity of i to all other objects of A.

d(i, C) = average dissimilarity of i to all objects of C. b(i) = minimum d(i, C), where C ≠ A. B
= the cluster B for which minimum is attained the neighbor of object i The cluster B is like the second-best choice for object i: if it could not be accommodated into cluster A, which cluster B would be the closest competitor In Fig. 3.The number s(i) write this in formula: The number s(i) is obtained by combining a(i) and b(i) as follows: Case #2 considered similarities (12) .In this case consideration similarities and define a'(i), d'(i, C), and put b'(i) = maximum d'(i, C), where C ≠ A. The numbers s(i) is obtained by

Sum of Squared Errors
The k-means clustering techniques defines the target object (x i ) to each group (C i ), which relies on the Euclidean distance measurement (m i ) is the reference point to check the quality of clustering.The Sum of Squared Errors: SSE is another technique for clustering validity.SSE is defined as follows (17) .for each object, where object i belong to cluster A (12) .
Conditions of applied the SSE for clustering, is to determine k ≥2 (22) .When the SSE is applied in graph that generated from the relationship between the SSE and k value at knee point (Significant "knee"), which is positioned to indicate the appropriate number of cluster in the k-means clustering (8) as shown in fig. 5.

Selection an Appropriate Number of Cluster
The principle of the monitoring tool for clustering, can support the selection of correct k values for the k-means clustering, consider the following.
Fig. 4, Silhouette is used to assist in cluster monitoring.This analysis is compared between Fig. 4 (a) and (b) it is found that the average silhouette of clustering when k = 3, the value 33 will be greater than k = 2, the value 28.
Fig. 5 the SSE is used in the inspection cluster.This analysis was shows the appropriate number at the knee clearly was 5(a) k = 3, 5(b) k = 4 and 5(c) k = 5, which the appropriate number of cluster.

Experimental Data
The research uses data synthesized with 3 shapes and each shape have 4 datasets (100 items), which is applying a Gaussian distribution (normal distribution).Fig. 6 is the distribution of a spherical around the center of dataset, fig.7 the distribution is non-spherical lying on the x-axis, and fig.8 the distribution is spherical, but each group will have some overlap.

Results of k-Means Clustering Method
In experiments, the researchers repeated the k-means clustering algorithm with datasets by changing the value of k, set k = 2 to k = 10, which illustrate the specific clustering when k = 2, 4, and 6 shown in fig.6, fig.7 and fig.8.
The next step is to investigate the cluster.This relies on the analysis of both Silhouette and SSE of above mentioned, are as follows.Fig. 4. Shown Silhouette was present clustering when k=2 and k=3 (12) .

Clustering Validity with Silhouette Measure
Consider Fig. 9 is an illustration Silhouette of a clustering technique to the k-means repeating the grouping by changing the value of k from 2 to 10, which shows a comparison of the density and separation of each cluster.Which found that the density of the k values of k = 2 and k = 4 show the density and separation is optimal.
Using the silhouette to assess the quality of clustering not silhouette diagrams only in addition need to consider the average of silhouette.It was found that the average of all silhouette values when k=4 the highest shown in table 1.

Clustering Validity with SSE
The Result of SSE for inspection the cluster is shown in Table 2 (5) i can will be ≥ 2

Conclusions
The results of research above when examining the Silhouette clustering analysis is to determine the k = 4 was the highest average Silhouette with all the data sets.When examining the clustering of the graph that shows the relationship between the SSE and k value with k = 4 the result was the knee point.That means the examination of both Silhouette and SSE are result inconsistent.Is that the number of cluster as the same number that k = 4.
However, a comparison of SSE and Silhouette have to attention is if the data does not overlap.Assessment the number of cluster is appropriate both SSE and Silhouette.However, when the data begin to overlap SSE will provide an assessment that is more close to the true value.
(9)  have proposed the concept of clustering the documents based on the concept of reducing the dimension of the data, combined with the clustering k-means based on clustering with support vector and silhouette measure.They have experimented with the patent, documents from UCI to analyze separately each group of documents to clustering for technology forecasting.

Fig. 1 .
Fig. 1.Data clustering when k=3 and k=5 Choose initial centroids {m 1 ,…,m k } of the clusters {C 1 ,…,C k }.(b) Calculate new cluster membership.A feature vectorx j is assigned to the cluster C i if and only if Recalculate centroids for the cluster according.
Example, fig. 4 shows the results silhouette of clustering, when fig. 4 (a) present clustering on k = 2 and fig. 4 (b) clustering on k = 3.The Figure shows the comparison of result: density and separation, Neighbors, the average Silhouette of each cluster.Which silhouette is used to support the evaluation clustering with the maximum of silhouette.
Fig. 5. Number of cluster consideration from the relationship between SSE and the k value.
. The table 2 shows the SSE value and rate of change of the SSE when k = 2 to 10, found that when k = 4 SSE is the maximum rate of change.The rate of change (%Change) defined as follows.% ℎ! BC = ../D(E F − ../D(E * 100 ../D(E .

Table 2 .
Show SSE values and %change from k-means algorithm when k=2 to 10.

Table 1 .
Show comparison of the average of the Silhouette of a k-means clustering when k = 2 to 10.