A Robust K-Means for Document Clustering

We propose a robust K-means clustering algorithm for document clustering, where we suppose that a document-term matrix is given as an input dataset, and the documents in the dataset are clustered on the basis of the frequency of terms that occur in each document. We introduce a robust loss function to K-means clustering to obtain its robust version, and also propose a feature transform method for improving the performance of document clustering. Experimental results show that the proposed method improves the robustness of K-means to outliers and the performance of document clustering demonstrated on one of the BBC datasets originating from the BBC News.


Introduction
Document clustering is a process for finding a number of clusters of similar documents from a large set of textual documents, and has various applications including information retrieval, text mining and automatic document organization. Numerous document clustering methods have been extensively studied in these years. Andrews and Fox overviewed recent developments in document clustering research [1]. Shah and Mahajan reviewed semantic driven document clustering methods [2], and also provided a detailed overview of various document clustering algorithms [3]. Balabantaray et al. compared K-means and K-medoids clustering algorithms for document clustering, and observed that K-means yields better result than K-medoids [4]. Dzogang et al. extended the spherical Kmeans [5] to an ellipsoidal K-means [6]. Mei and Wang proposed a hyperspherical fuzzy C-means clustering algorithm for online document categorization [7].
The above document clustering methods based on Kmeans and its variants including K-medoids do not introduce robust techniques directly. In this paper, we propose a robust K-means clustering algorithm for document clustering. In robust statistics, various approaches have been studied to achieve robustness [8]. A reasonable approach to get robustness is to design a robust potential function [9] or loss function [10]. We propose a robust loss function, and apply it to K-means clustering. The proposed loss function has an intermediate form between the l 1 and l 2 -norms. Hamza and Brady used a robust cost function similar to the proposed loss function for robust nonnegative matrix factorization for reflectance spectra reconstruction [11]. Their cost function is a special case of the proposed loss function, which is also similar to the Charbonnier loss function [12].
Recently, Barron generalized popular loss functions including the Charbonnier loss function to a two-parameter loss function [10].
We also propose a method for feature transform from a document-term matrix into feature vectors of documents based on the Hellinger distance [13], which is a distance between two probability distributions. Bui et al. [14] also used the Hellinger distance for multi-criteria document clustering with the latent Dirichlet allocation (LDA) proposed by Blei et al. [15]. Our proposed feature transform method is suitable for K-means clustering algorithms. Experimental results show that the proposed robust K-means clustering algorithm can alleviate the influence of outliers being in data, and improve the performance of document clustering.
The rest of this paper is organized as follows: Section 2 briefly summarizes conventional K-means clustering. Section 3 proposes a robust K-means clustering algorithm based on a robust loss function. Section 4 describes a feature transform method for document clustering based on a document-term matrix. Section 5 shows experimental results of clustering of artificial and real datasets. Finally, Section 6 concludes this paper.

K-Means
Clustering Let x i ∈ R d be a d-dimensional real vector for i = 1, 2, . . . , n. Then K-means clustering aims to partition them into K clusters, and is formulated as the following optimization problem: where C k denotes the set of vectors in the kth cluster, µ k ∈ R d denotes the centroid in C k for k = 1, 2, . . . , K, and ∥ · ∥ denotes the Euclidean norm. For a fixed set of µ k , assigning each x i to the nearest centroid µ k , we can minimize the ob- (2) After the assignment of all the vectors, we have K clusters {C k } temporarily. Then, for the obtained {C k }, we optimize each µ k by solving ∂E/∂µ k = 0 for µ k as follows: The above procedure for updating {C k } and {µ k } is repeated until the convergence. The K-means clustering algorithm is summarized as follows: [K-means clustering algorithm] 0. Assume that a set of vectors {x i } n i=1 and a number of clusters K are given.

2.
Assign each x i to the nearest cluster determined by (2).

Update each µ k by (3).
4. If every µ k is unchanged by the above update, then halt the procedure. Otherwise, go to the step 2.

Robust K-Means Clustering
The above K-means clustering is based on the squared Euclidean distance between a given vector x i and a centroid µ k . The squared Euclidean distance can be expressed as follows: for ρ(x) = x 2 . The function like ρ(x) is referred to as the loss function [10]. In this section, we propose an alternative form of ρ(x) as follows: for a positive constant b. When b = 0, the function ρ R (x) coincides with the absolute value function |x|, which is more robust to outliers than We have the following properties of ρ R (x): Property 2. |dρ R /dx| ≤ 1.
Based on the above loss function, we formulate a robust K-means clustering as follows: Let E R ({µ k }, {C k }) be the objective function in (7). Then we have the necessary condition for optimality as follows: where 0 is a d-dimensional zero vector having all components equal to zero. From (8) we have which is used for updating µ k instead of (3) in the proposed robust K-means clustering algorithm.

Feature Transform for Document Clustering
Assume that a document-term matrix is given as a dataset for document clustering. Let A = [a i j ] for i = 1, 2, . . . , n and j = 1, 2, . . . , d be a document-term matrix, where a i j is the (i, j) element of A, and denotes the frequency of the jth term that occurs in the ith document. Then we compute the probability of occurrence of each term in each document by Then the Hellinger distance between p i and p i ′ is defined by [13] The square of D h ( p i , p i ′ ) can be written as wherep i = [ √ p i,1 /2, √ P i,2 /2, . . . , √ p i,d /2], and D 2 E (p i ,p i ′ ) denotes the square of the Euclidean distance betweenp i and p i ′ . That is, the Hellinger distance between p i and p i ′ is equivalent to the Euclidean distance betweenp i andp i ′ . As a result, we obtain a means of (robust) K-means clustering based on the Hellinger distance by the feature vector transform given bỹ where the square root of a vector denotes the elementwise operation.

Experimental Results
In this section, we first show the experimental results on artificial two-dimensional data, and then show the results on a real document-term dataset. Fig. 1 shows the graphs of ρ(x) = x 2 and ρ R (x) in (5), where the horizontal and vertical axes denote x and ρ(x) or ρ R (x), respectively. The blue line denotes ρ(x) = x 2 which is not robust to outliers because the value of ρ(x) increases rapidly with the increase in |x|, which means that the data being far apart from a point of attention have a big impact on results. On the other hand, the red lines denote ρ R (x) in (5) for b = 1, 5 and 10 with solid, broken and dashed dotted lines, respectively. These red lines have gentler slopes than the blue line; the slopes do not exceed 45 • as described in Property 2. Additionally, ρ R (x) is under ρ(x) if b ≥ 1/2 by Property 3. Fig. 2 shows an artificial two-dimensional (2-D) dataset which consists of three main clusters on the left side and three outlying points on the right side. Fig. 3 shows a zoomed part in Fig. 2 including the three clusters, where three centroids given by conventional K-means clustering algorithm with K = 3 are denoted by red circles, one of which at the upper right deviates toward the outliers on the right side. Fig. 4 shows the same region as Fig. 3 and three centroids given by the proposed robust K-means clustering algorithm with b = 1/2, where the deviation of the upper right centroid toward the outliers is alleviated compared with Fig. 3. The condition for halting the algorithm is that

Artificial Two-Dimensional Data
where the superscript t on µ k denotes the number of iterations of the updating procedure, i.e., t = 0, 1, 2, . . ., and ϵ is   a positive constant which is set to ϵ = 10 −6 in this example. All centroids µ k for k = 1, 2, . . . , K are randomly initialized and used commonly in both the conventional and proposed algorithms. Table 1 shows the numerical data of the coordinates of the centroids and the numbers of iterations for compared algorithms. The first and second rows in the table show the (x, y) coordinates of the centroid of the upper right cluster in Fig. 3 or 4, from the left to right columns, for outlier-free result (denoted by 'No outliers'), conventional K-means (Conv.), and the proposed robust K-means for b = 1, 1/2 and 1/10, respectively. The x-coordinate for the conventional K-means is quite different from that for the outlierfree result. On the other hand, the proposed method for every value of b achieved nearer x-coordinates to 'No outliers' than the conventional K-means. Better y-coordinates are given when b = 1/2 and 1/10 in the proposed method. The

Real Document-Term Data
Next, we show the results on the BBC Datasets [16], from which we selected the BBCSport dataset which is a collection of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas with 4613 terms, i.e., the size of the document-term matrix A is n × d = 737 × 4613, and the five topics are athletics, cricket, football, rugby and tennis. We evaluate the performance of clustering algorithms with matching matrices (or confusion matrices) [17]. Table 2 shows a matching matrix given by conventional K-means, where the row vectors of the document-term matrix are directly used for the feature vectors of documents. The initial centroids are selected from the documents as µ 1 = a 15 , µ 2 = a 168 , µ 3 = a 325 , µ 4 = a 513 and µ 5 = a 713 , where a i denotes the ith row vector of A, and commonly used in the following experiments. The top row of athletics shows that the actual documents on athletics are divided into three clusters, athletics, football and tennis. The rightmost column in the table shows the total number of documents in each category. In the matching matrix, if the       diagonal elements are larger and the off-diagonal elements are closer to zero, then the obtained clusters show a better match with the true categories of documents. Table 3 shows the matching matrix given by the proposed robust K-means with b = 1/2. The diagonal elements in Table 3 are larger than that in Table 2 except for the central  element corresponding to the football category.  Tables 4 and 5 show the matching matrices given by Kmeans and robust K-means with the proposed feature transform described in Section 4, respectively. In both methods, the performance is improved by the feature transform. Tables 6 to 9 show the averaged tables of confusion [18] computed from Tables 2 to 5, respectively. In these tables, the larger diagonal elements (True Positive (TP) and True Negative (TN)) and the smaller off-diagonal elements (False Negative (FN) and False Positive (FP)) are, the better the performance is.
Finally, Table 10 summarizes the F-measures for compared methods, where an F-measure is the harmonic mean of precision and recall given by That is, the F-measure is given by In Table 10, the proposed robust K-means with feature transform achieves the highest value of the F-measure among the compared methods.

Conclusion
In this paper, we proposed a robust K-means clustering algorithm for document clustering, where a document-term matrix is given as an input dataset. Initially, each document is expressed by the corresponding row vector in the document-term matrix. To improve the performance of document clustering, we also proposed a feature transform method based on the Hellinger distance between two probability distributions to have better feature vectors than the row vectors of the document-term matrix. Experimental results showed that the proposed method improved the robustness to outliers and the performance of document clustering compared with conventional K-means. Future work will include the development of the fuzzy versions of the proposed robust K-means for improved document clustering.