Using network topology and rule-based strategy to identify community structure in social networks

Community detection is to identify groups of nodes or network partitions that the edge connectivity is tight inside communities and loose between communities. Network type variation results in different meanings of the community structure, for example, the community structure corresponds to a set of web pages on the same topic in World Wide Web, or to target groups of customers exhibiting the same purchase behaviors in a social network. Hence, many approaches like evolutionary computation, data mining, modularity optimization, density-based and topology-based methods have been proposed to address community detection in many disciplines. In this study, we propose a hierarchical arcmerging algorithm using network topology structure and rule-based arc-merging strategy to identify community structure and to reveal network hierarchy. Five well-known social networks with ground-truth-community are used to verify the correctness of the identified community structure, and two synthetic networks are used to examine the effectiveness of avoiding the resolution-limit problem. The experimental results show that the identified community of the proposed method is more close to the ground-truthcommunity and also overcomes the resolution-limit problem.


Introduction
Many real world systems can be formed as networks which consists of sets of nodes and edges (1)(2)(3) .For example, in a social network, nodes represent persons and edges describe friendships between any two of them; in a collaboration network, nodes are authors and edges are the cooperation of authors in a paper; in Word Wild Web (WWW), nodes are web pages and edges are hyper-links from one page to other pages.Networks have been investigated and found many important properties (1) , such as "small-world effect" indicates high clustering coefficient and shorter average path length between nodes in a network, "long-tail" describes the degree distribution of a network follows the power-law which means many nodes have a small number of connections and few nodes have a lot of connections in a network, and "community structure" represents nodes tightly connected to each other with similar features in the same group and loosely connected to nodes of other groups.
The goal of community detection is to identify groups of nodes or network partitions satisfying a specific criterion such as compactness of edge connectivity (1,4) .Network type variation results in different meanings (1)(2) -for example, in terms of community structure corresponding to a set of web pages on the same topic in WWW; of social groups in an acquaintance network; of a circuit, pathway or motif serving a certain synthesizing or regulating function in a metabolic network; or of a target group of customers exhibiting the same purchase behaviors in a social network.Hence, the community detection approach is widely used to identify community structures or social groups in networks for many practical application, e.g.marketing and disease prevention.
Accordingly, community detection problem can be reduced to a graph partition problem.However, the reduced problem is a NP-Complete (NPC) problem (5) .Therefore, in many disciplines, a large number of approaches have been proposed to find approximate solutions of community detection problem.For example, in computer science, the evolutionary computation (EC) approaches such as genetic algorithm (GA) (6)(7) , ant colony optimization (AOC) (8) and particle swarm optimization (PSO) (9) and artificial intelligence (AI) approaches such as greedy algorithm (10) and simulate annealing (SA) (11) are often used for NPC problems; in network science, tradition hierarchical clustering algorithms (e.g.GN (2) and FN (10) ) were proposed and used in the early stage, and the modularity optimization algorithms (e.g.CNM (12) and BGLL (13) ), k-medoids algorithm (14) , density-based algorithms (e.g.DenShrink (15) and ImDS (16) ) and topology-based algorithm (e.g.LPW (17) ) were proposed in recent development.
Modularity (10,18) has been widely used to evaluate the quality of community structure resulting from the above methods.Modularity evaluates the density of edges inside communities as compared to the edges between communities.Higher modularity value indicates the quality of community structure or network partition is better than lower one.Hence, modularity is used as fitness function in the evolutionary computation approaches and is used as objective function in the network science approaches for finding the best approximate solution of community detection.However, the resolution-limit (19) , a serious and inherent problem of modularity has been mentioned and investigated when the size of a community is smaller than a certain scale may not be resolved and will be merged into a bigger community.Therefore, the method uses modularity as fitness or objective function may suffering from the resolution-limit problem, but density-based algorithms are not.
In this study, we propose a hierarchical arc-merging (HAM) algorithm which takes advantages of topology-based and density-based algorithms using network topology and rule-based arc-merging strategy to identify community structure and to reveal network hierarchy.Similarity measures are used to evaluate how similar any two nodes are and to rank the edge importance.Rule-based arc-merging strategy is applied to identify communities and to reveal hierarchical network structures.In the preliminary experiment, five wellknown social network datasets are used to verify the correctness of finding ground-truth-community, and the two synthetic networks are used to examine whether HAM suffers from the resolution-limit problem.According to the experimental result, the identified community structure of HAM is close to or even identical to the ground-truthcommunity of social networks.Also, HAM is capable of overcoming the resolution-limit problem., where 1 if the similarity of node i and j is identical and 1, otherwise 0.

Similarity measures
Similarity measures are used to determine the weight of edges of a network (20) .The most common approach of determining the similarity weight of node i and j, , , is to calculate the number of common neighbors , as (1).High similarity weight means the two connected nodes share a lot of common neighbors.If two nodes share the same neighbors meaning that these two nodes are considered structurally equivalence.Further, , becomes different similarity measures using various ways of normalization such as cosine similarity, Jaccard index or minimum similarity which are defined as: where is neighbor set of node i, the neighbor set of node j, | | the number of neighbor set of node i, and | | the number of neighbor set of node j.

Community quality measure
Since there are many possible partitions of a given network; hence, how to measure network partition quality is an important issue.Modularity (10,18) has been widely used as fitness or objective function to evaluate community structure quality in many approaches.Regarding modularity, a meaningful network partition is one with many edges inside communities and only a few between communities.The "meaningful" means that the number of connections inside a community should larger than the expected value of randomized connections of a community with the same size and degree sequence.Hence, the randomized network is used as the null model in modularity.For a given network with communities, modularity is defined as where is the proportion of edges with two endpoints in community i, the proportion of edges with at least one endpoint in community i, the number of edges with two endpoints within community i, and the sum of the degree of nodes in community i.
A serious and inherent problem of modularity called resolution-limit has been mentioned (19) .In the modularity optimization algorithm, the size of some communities may not be resolved, and will be merged into bigger communities which means modularity may miss some important substructures of network.Therefore, if only using modularity to identify communities should consider how to avoid the resolution-limit problem.

The proposed method
In this study, we propose a hierarchical arc-merging (HAM) algorithm using network topology and rule-based arc-merging strategy to identify community structure and to reveal network hierarchy.First, edges are classified into three classes according to the similarity weight of each edge.Second, the rule-based strategy is applied to merge nodes, to find communities, and to construct hierarchical network structures.

Edge classification
Since similarity measures are used to calculate the similarity weight of edges, and then the edges are classified into three classes: weighted-edge , bridge-edge and sink-edge .After that, each edge class is sorted in decreasing order by two indexes which are the weight of edges and the sum of weights between two endpoints and their neighbors.The weighted-edge is a set of edges which the weight of node i and j is larger than zero, i.e.

0.
The bridge-edge is a set of edges with the weight equals to zero, i.e. 0, which means these edges are connecting different communities or components in a network.The sinkedge is a set of edges which either degree of node i or j equals to one, i.e. 1 or 1; additionally, in a weighted graph, the weight of node i and j also equals zero, i.e. 0 .Hence, ∪ ∪ and ∪ ∪ , where is node set of weighted-edge, the node set of bridge-edge, and the node set of sink-edge.

Rule-based arc-merging strategies
The idea of arc-merging is to merge the two endpoints of edges into super-nodes according to the decreasing order of similarity of edges, and super-nodes are as seeds to attract the nearby non-merged nodes to become larger communities or substructures.Therefore, we define four arc-merging rules to achieve the idea, and these rules can be combined to build different strategies for different edge classes.For each edge , in , the arc-merging rules are defined as Rule-4 (R4): otherwise, pass.
According to edge classification, two kinds of strategies can be built using the above arc-merging rules: weightedmerging strategy and non-weighted-merging strategy, which are defined as Weighted-merging strategy (T1): use R1, R2, R3 and R4.Non-weighted-merging strategy (T2): use R2, R3 and R4.
First, the weighted-merging strategy consists of all four rules for merging the weighted-edge .For each edge in , R1 is used to merge the two endpoints of edges with high similarity into super-nodes as seeds, R2 and R3 are used to attract the nearby non-merged nodes of super-nodes, and R4 is used to handle the other cases.Second, the non-weighted strategy consists of three rules excluding R1 for merging the bridge-edge and sink-edge .After applying T1, all endpoints of weighted-edge are merged into supernodes.Hence, the major communities or substructures of network are almost found.However, some endpoints of bridge-edge and sink-edge are not.Therefore, the non-weighted strategy is used to process these nodes.

Hierarchical arc-merging (HAM) algorithm
In HAM, the architecture of community detection consists of two phases: the original network phase and the super-node network phase, and is shown in Fig. 1.In the original network phase, first, the similarity measure is used to calculate the similarity weight of edges.Second, the T1 strategy is applied to process weighted-edge .After that, the nodes of bridge-edge and sink-edge remain nonmerged, i.e.
. Third, the T2 strategy is applied to process the remained nodes.Finally, every node of original network is merged to one community and the high-level network structures are constructed by super-nodes.
In the super-node network phase, first, the similarity weight of edges are given by calculating the average of similarity weight of edges between the corresponding communities.Second, classify edges into three classes, i.e. , and .Third, the T1 strategy is applied to process weighted-edge .Fourth, the T2 strategy is applied to process sink-edge .Finally, higher-level network structures are constructed.According to the edges between super-nodes should be used to connect different communities, therefore the T2 strategy is not applied to the bridge-edge .The procedure stops until the weighted-edge is empty.The details of HAM is shown in algorithm 1.

Preliminary experimental results
In this study, we use five well-known social networks with ground-truth-community to analyze how well the identified communities match the real communities, and use two synthetic networks to test whether HMA suffers from the resolution-limit problem.The giant connected component (GCC) of these networks used in the experiment is shown in Table 1.

Real social networks
The Zachary's karate club network (21) , there are 34 nodes and 78 edges in this network which nodes are club members and edges are friendships between members.Due to a conflict of disagreement between administrator of the club and the club's instructor, and then the club is split into two groups.The instructor was leaving the club, taking a half of club members and creating a new club.

23: return
The Dolphins network (22) , this network consists of 62 bottlenose dolphins living in Doubtful Sound, New Zealand, and 159 interactions of dolphin pairs occurring more often than expected by chance.The network is constructed by the observations of seven years from 1994 to 2001.This network can be divided into two groups by following the departure of a key individual of the population.
The United States college football network (2) , there are 115 teams which play 613 games during the regular fall season of 2000.Nodes are represented as teams and edges are represented a game play of two teams.The teams can be divided into 12 conferences which the teams play more games in the same conference than the games between different conferences.
The political book network (23) , this network is constructed by users' historical purchase behaviors of buying political books on Amazon.comwebsite.There are 105 books and 441 edges in this networks.Nodes are books and edges are co-purchasing relationships which a user bought one book and also bought another one book.The books are manually classified into three classes which are conservative, neutral and liberal.
The political blogs network (24) , this network is composed of blogs which are personal or group web diaries about U.S. presidential election of 2004 and the web links between blogs.There are 1222 blog websites as nodes and 16714 edges between nodes.The blogs are manually labeled by its political orientation, and can be divided into two groups which are conservative and liberal.

Synthetic networks
Two synthetic networks (19) are used to examine whether the proposed method suffers from the resolution-limit problem.The first synthetic network is called RING, the RING is constructed by a ring of cliques (with is even) connected by a single edge.Each clique is a complete graph which consists of nodes and 1 2 ⁄ edges.
The second synthetic network is called PAIR, the PAIR is constructed by two parts.The first part is two complete graphs connected by a single edge.The second part is two complete graphs also connected by a single edge.One clique of first part connects to two cliques of second part using an edge respectively.The parameters of RING are 5 and 30, and RING consists of 150 nodes and 330 edges.The parameters of PAIR are 20 and 5, and RING consists of 50 nodes and 404 edges.

Community validation
Normalized mutual information (NMI) is used to measure the quality of community structures identified by an algorithm that means calculating the degree of similarity between true partition and identified partition .The NMI is defined as , , , where ∑ ∈ is the Shannon entropy of true partition and the proportion of community in partition .
is the Shannon entropy of identified partition and the proportion of community in partition .
, is the mutual information for and indicating the level of correct identified by an algorithm when is given., is the percentage of overlap between and .If , 1 , the two partitions are identical; otherwise they are considered independent.

Experimental results
The community detection results of HAM are comparing to conventional methods (e.g.CNM (12) and LGA (7) ) and the referred values of topology-based algorithm (e.g.LPW (17) ) are shown in table 1.The configuration of LGA used in the experiment is that the gene representation is locus-based representation and the genetic operations are truncation selection, uniform crossover and bit-by-by mutation.The parameters of genetic operations are that selection rate is 0.1, crossover rate is 0.8 and mutation rate is 0.05.LGA performs 10 runs evolution, each evolution consists of 200 generations.The population size is 50.
In table 1, the community detection results include the number of identified community, number of level, modularity value and NMI value.The results of number of identified community can be compared to the number of ground-truth-community that means how many true communities can be found by the method.The results of number of level shows the best NMI value in which level and how many levels constructed by HAM.For example, the "2/3" of dolphins network means that the large NMI value is in level 2 and there are three levels constructed by HAM.In the results of modularity, the best modularity values are found by CNM or LGA, because CNM and LGA are trying to maximize the modularity.
However, HAM cannot find the best modularity value according to the mechanism of modularity optimization is absent.In the NMI results, HAM can find higher NMI value than other methods that means the identified communities of HAM are more close to the ground-truth-community.The subscripts of NMI values, e.g."c", "j "and "m", means using cosine similarity, Jaccard index and minimum similarity in HAM respectively.Further, the RING and PAIR networks are used to examine whether the method overcomes the resolution-limit problem.For example, the method is free from resolution-limit problem in case of all ground-truthcommunity of PAIR (or RING) network can be identified.The results shows that HAM overcomes the resolution-limit problem as shown in Fig. 2.
The Fig. 3 shows the network hierarchy demonstration of dolphins network.The Fig. 3a is the connection structure of original network, i.e. level 0; the solid-line indicates the weighted-edge; the dash-line is represented the bridge-edge; and dot-line shows the sink-edge; according to the clarity of connection structure, the nodes are not shown.The Fig. 3b is the ground-truth-community of dolphins network.The Fig. 3c and 3d show the projection from the specific level of super-node network to the level of original network.For example, the Fig. 3d exhibits the projection from level 2 of super-node network to the level 0 of original network, and also presents that each node belongs to which high level community.Each level of super-node network exhibits various community structure, i.e. the higher level structures are constructed by lower level structures.According to only one community in level 3, therefore the visualization of level 3 is not shown.

Conclusion and future work
In this study, we propose a hierarchical arc-merging (HAM) algorithm using network topology structure and rulebased arc-merging strategy to identify community structure and to reveal network hierarchy.Five well-known social networks and two synthetic networks with ground-truthcommunity are used to analyze and to test the proposed method.The experimental results indicates that HAM is capable of identifying the community structures with higher NMI value that the identified community result is more close to the ground-truth-community.The experimental results also show that HAM overcomes the resolution-limit problem.Although HAM shows its advantages in identifying more correct community detection results; however, we did not do the detailed comparison between HAM and other state-of-art methods such as density-based and the topology-based algorithms in this study.Also, the mechanism of modularity optimization should be taken into consideration when a given network without ground-truth-community.The issues mentioned above and the further improvement are left as future works.

Fig. 1 .
Fig. 1.The architecture of community detection in HAM.

2 .
The comparison of resolution-limit problem in PAIR network.(a) The original network (level 0) (b) Ground-truth-community Level 1 Level 0 (c) The projection from level 1 of super-node network to level 0 of original network Level 2 Level 1 Level 0 (d) The projection from level 2 of super-network to level 0 of original network Fig. 3.The network hierarchy demonstration of dolphins network.

Table 1 .
The giant connected component (GCC) and community detection results of networks.