Ensemble Learning For Imbalanced Data Classification Problem

Imbalanced data is a kind of information that occurs in real life, such as medical diagnosis in which records of seriously ill patients outnumber by records of healthy ones. These imbalanced data affect the learning performance of algorithms in data mining. The boundary of decision in out of balance data chosen by most standard algorithms of machine learning tends to bias toward the majority class and hence misclassifies the minority class. Therefore, we present an approach for dealing with imbalanced data classification problem by applying the decision tree ensemble learning using both bagging and boosting techniques to build models that compensate the misclassification with cost sensitive learning. In this research, we build the model templates from different characteristics of synthetic data. We have chosen an appropriate model template for the real data with different imbalanced rating and overlapping ratio. The results showed that the chosen model template can solve the imbalanced data classification problem efficiently. But there are some model templates that cannot classify correctly when imbalanced rate increases.


Introduction
Data mining (1) is a method that has been extensively used to retrieve the hidden knowledge from a large information repository.Data mining task has many categories depending on the purpose of application.The one of those categories is data classification that aims to learn patterns to make prediction about the class of some unknown data.Most standard algorithms for data classification can be applied very efficiently in terms of overall classification accuracy if data in each class are in equal proportion.However, these algorithms show poor learning performance when classifying the imbalanced data that have amount of instances in the group of interest less than those in the other groups (2) .
For example, we can demonstrate a comparison between classifying balanced data and unbalanced data with 300 instances and two classes.For the balanced dataset, the amounts of data in the two classes are equal; that is, 150 instances in each class.For the unbalanced dataset, there are 285 instances in the class a, whereas there are only 15 instances in the class b.Take both datasets to be classified by decision tree induction.The results are shown in Fig. 1.Both datasets show good performance in terms of overall classification accuracy.When considering accuracy in each class, we found that the performance of classifying class b in the balanced data is more accurate than classifying class b in the imbalanced data.This classification accuracy drops drastically from 0.947 to 0.467.This example indicates that using imbalanced data in classification will affect the learning performance of algorithms that tend to bias toward the group of majority and cause high misclassification rate over a group of minority.
The problem of classifying imbalanced data mentioned above has drawn attention from many researchers to propose various methods to solve this problem.The proposed methods focus on a more accurate classification over a minority group.Some important work that proposed the methods mentioned above are as follows: Brown and Mues (3) proposed the undersampling technique to deal five credit scoring imbalanced datasets.
Cateni et al. (4) proposed the oversampling and undersampling techniques to deal the benchmarks imbalanced datasets from the KEEL Repository.
Lopez et al. (5) studied the performance of classification with the hybrid techniques of SMOTE+ENN and SMOTE.They solved the problem by combining the techniques at data level approach and algorithm level approach and focused on the cost-sensitive learning.They compared these techniques with their proposed hybrid techniques and performed experiments with 66 datasets that were taken from the KEEL Repository.
Krawczyk et al. (6) proposed the decision tree ensemble data classification and cost-sensitive learning to deal with six binary benchmarks imbalanced datasets from the KEEL Repository.They proposed a technique to prune the decision tree using the novel algorithm and optimal C minority derived from the ROC analysis.They compared the proposed method with the other six methods.The result showed that their proposed method was efficient in some datasets.
Liao et al. (7) proposed the ensemble learning for binary classification.They used Support Vector Machine (SVM) for rebalancing data in the stage of preprocessing and then selected the features for ensemble learning with Back-Propagation Neural Network (BPNN).The outputs from ensemble learning were taken to build new knowledge by using the rough set theory.They performed experiments with the listed electronics companies from 2005 to 2011: 63 financial crisis corporations and 2680 non-financial crisis corporations.The result showed that their proposed method was more efficient than other methods.
We recognize the importance of solving imbalanced data classification problem with an effective method.Therefore, we present an approach for dealing with imbalanced datasets by applying the decision tree ensemble learning using both bagging and boosting techniques to build models and combine the compensation technique to handle misclassification with cost sensitive learning.
The rest of this research is organized as follows: Section 2 gives details of the background and relevant techniques.Section 3 presents details of our proposed method.The experimental results and analysis will be presented in Section 4. Finally, the research is concluded in Section 5.

Imbalanced Data
Imbalanced data is the data that have the amount of instances in the group of interest much smaller than those in other classes (2) .The group of data that has a larger number of instances is called the majority class or negative class, whereas the group of data that has a smallest number of instances is called the minority class or positive class (8) .When classifying imbalanced data, the boundary of decision acceptable by standard algorithms tends to bias toward the majority class and misclassify in minority class as illustrated in Fig. 2.
Characteristics of imbalanced data that can influence the classification algorithms (9) are divided into three cases as follows: (a) Imbalanced Ratio Imbalanced data can be verified by the degree of imbalance, which represents the ratio between the number of data instances in majority class (n majority ) and the number of data instances in minority class (n minority ) (10) .The imbalanced ration can be defined by Eq. ( 1) (b) Lack of data This problem occurs when the size of samples in the minority class is too small (9) .Because of small sample size will cause difficulty in finding the patterns.
(c) Overlapping ratio between classes Overlap occurs when the data of each class has shared area.Overlap that occurs in conjunction with imbalanced data would result in the more complex situation for classification (11) .Maximum Fisher's Discriminant Ratio (F1) is one method that can be used for measurement the overlapping ratio.The F1 is defined by Eq. ( 2) The methods for solving imbalanced classification problem (5,12) can be divided into three approaches as follows: (a) Data Level Approaches This approach solves the problem in a pre-processing stage by rebalancing the class distribution using the sampling techniques: oversampling, undersampling, and a hybrid technique.
(b) Algorithm Level Approaches This approach attempt to adapt existing algorithms by adjusting the parameters.
(c) Cost Sensitive Approaches This approach uses both data level approaches by adding special cost to misclassification and algorithm level approaches by modification the algorithms to the possible classification that leads to less errors.

Ensemble Learning
The functions of single model have high classification performance but have a problem in terms of a fixed a set of parameters, which causes the bias.Reduction of such bias can be obtained through the ensemble learning.
The performance of ensemble learning depends on the precision of classifiers.In ensemble classification learning, multiple classifiers are used to learn the original dataset together.The results from learning will be combined and then used to classify the unknown data.The process of ensemble learning is given in Fig. 3 (13) .Ensemble learning can be divided into three approaches as follows: (a) Boosting method Boosting method ( 14) is an ensemble classification such that each classifier has a weight which is derived from the precision of learning.The results models are used to predict the unknown data by the majority vote.The popular technique is AdaBoost.
(b) Bagging method Bagging method (15) builds the models from the same learning algorithm but each algorithm learns from different instances.This method also uses majority vote for prediction of unknown data.The popular technique is Bag.
(c) Random subspace method Random subspace method or Attribute Bagging ( 16) learns from the same dataset and then performs sampling without replacement over the features.The method also uses majority vote for the prediction of unknown data.

Cost Sensitive Learning
Cost-sensitive learning takes into account the cost of misclassification.Penalties of misclassification will be built as a cost matrix as shown in Table 1.
From Table 1, let C(i,j) be the cost of predicting the sample in class i as class j.C(0,0) and C(1,1) are the cost of correct classification which is set to be equal to 0. C(0,1) is the cost of misclassifying of majority class to be minority, and the cost is set to 1. C(1,0) is the cost of misclassifying of minority class to be majority.The cost is C minority , which can be adjusted according to the specific algorithm.
The most important issue for solving the imbalanced data classification problem is recognizing correctly the positive instance (minority class) rather than the negative instance (majority class).Therefore, the cost of misclassifying of minority class must higher than the cost of misclassifying of majority class (C(1,0) > C(0,1)).

Methodology
In this research, our main objective is to find the classification model efficiently for solving the imbalanced data classification problem with high accuracy and efficiency.Our concern is to improve the process of imbalanced classification at different imbalanced ratio and overlapping ratio between classes.We apply the decision tree ensemble learning using both bagging and boosting techniques to build models and compensate the misclassification with cost sensitive learning by building cost matrix and then take the values from cost matrix to adjust the parameters of ensemble learning.We also find the optimal number of trees by visualization.The framework and the steps are shown in Fig. 4, we can explain in detail of each stage as follows: (a) Building the Model Templates This stage is for building the model templates with the following two steps: 1.
Generate the numeric synthetic data, 1000 instances with normal distribution, slightly overlapped with imbalance ratios of 10%.
(b) Building the System Model This stage is for building the system model with the following six steps: 1.
Normalization the features to zero mean and set standard deviation equal to 1.
2. Sampling data by using stratified sampling to draw samples from the imbalanced data with several of imbalance ratios.We call the sampled data DB1.
3. Find optimal cost value for minority class by generate cost matrix for misclassification cost.We are setting IR of DB1 to be the C minority .Other values in the cost matrix will be determined by a constant: C majority equals 1, C(0,0) and C(1,1) equals 0.
4. Model matching by analyzes the characteristics of DB1 and then choose the appropriate model from the model templates.

5.
Building the imbalanced classification model using the model templates and the best value of C minority .We initialize the number of decision trees to be equal to 200 and test the model by k-fold cross validation, k=5. 6.
Find optimal number of decision trees, we employ visualization to reduce the number of decision trees obtained from ensembles learning to achieve a number of suitable decision trees.The visualization will show the test of classification error for each decision tree and we will choose the top-10 decision trees with the minimum error.
(c) Test Performance of the Ensemble Model The optimal ensemble model will be tested to evaluate its performance by the evaluation measure.

Datasets
For our experiment, the proposed ensemble models have been developed and applied for binary classification on the following datasets: (a) Synthetic datasets The synthetic datasets have been created using Matlab.We created five datasets which have a slight overlap with an initial imbalanced ratio of 10%, two classes and three features.Details of the synthetic datasets and the optimal model templates are given in Table 2 and one of the synthetic dataset is shown in Fig. 5.

Evaluation Metrics
In order to evaluate the effectiveness of the proposed method we used confusion matrix to show the accuracy of the classification and the reliability of the model.Detail of confusion matrix is given in Table 4.
From Table 4, row of the matrix shows the number of actual instances of each class and column shows the number of predicted of each class.It is divided into the following four cases: TP: number of instances that are correctly classified as positive class.
TN: number of instances that are correctly classified as negative class.
FP: number of instances that are negative class incorrectly classified as positive class.
FN: number of instances that are positive class incorrectly classified as negative class.This measure shows the precision of the classification model in classifying negative class, which is defined as the ratio between the number of correctly classified negative class and the total number of the actual negative class.The specificity is defined by Eq. ( 4) This measure shows evaluation of the overall performance of all classes in the classification of a model.The accuracy is defined by Eq. ( 5)

Results and Analyses
In this section, we present the results from the evaluation of our proposed model using six binary benchmarks imbalanced datasets.We perform stratified sampling to draw samples from imbalanced datasets at different imbalanced ratios and then analyze the characteristics of data to find the suitable model from the model templates.The results of the suitable model templates are given in Table 5.
We derive the C minority from imbalanced ratio (IR) and then use it in the step of building the ensemble model.The initial number of decision trees is 200 and then reduce this number to get the optimal number of decision trees using visualization.
Table 5. Optimal model templates for benchmark datasets.Fig. 6 shown the decrease in the number of decision trees of the yeast dataset adjusted with the smallest error rate.The optimal number of decision trees equal 30.
From the obtained optimal model templates shown in Table 5, we performed the experiments over the benchmark datasets with these models.We also performed the experiments with the rest of model templates as a base line for comparisons.We experiments with the three imbalanced ratios: 1:10, 1:25 and 1:50.Each imbalanced ratio is equal C minority .The performance of such models in terms of sensitivity (SE), specificity (SP) and accuracy (Acc) are given in Table 6, Table 7 and Table 8, respectively.The results of the experiments in Table 6, Table 7 and The model template is disabling to classify efficiently when increasing the imbalanced ratio.The examples are Bag model for segment dataset can classify correctly at imbalanced ratio at 1:10, and AdaBoostM1, TotalBoost and LogitBoost model for segment dataset can classify correctly at imbalanced ratio at 1:25.For increasing imbalanced ratios of shuttle dataset, our proposed method could improve the classification of imbalanced data significantly with chosen model template: LogitBoost model.
As for comparison with the results proposed by Krawczyk et al. (6) is shown in Table 9.The proposed method shows the performance is quite satisfactory; especially the imbalanced datasets are imbalance ratio of 1:25 and 1:50.For an imbalance ratio of 1:10, the proposed method was statistically better on three of imbalanced datasets, while the competition method was statistically better on the rest of imbalanced datasets.For an imbalance ratio of 1:25, the proposed method was statistically better on all of six imbalanced datasets.For an imbalance ratio of 1:50, the proposed method was statistically better on five of imbalanced datasets, while the competition method was statistically better on the rest of imbalanced datasets.

Conclusions
Imbalanced data classification is a significant challenge for standard algorithms of machine learning.In this research, we propose the method to deal with imbalanced data classification problems with the main focus to improve recognition of the minority class.We combine the cost-sensitive learning with ensemble decision tree classification using bagging and boosting techniques.The numbers of decision trees are also decreased to an optimal number of decision trees by visualization.We create normally distributed synthetic data with binary classes, three features and 1000 instances.Then, we build the model templates from five algorithms: AdaBoostM1, Bag, TotalBoost, LogitBoost and RUSBoost.We analyze the standard datasets and selected the best model template by considering the overlapping between classes, mean and standard variation of minority class and majority class with closed to model templates.The experiments show that the overlapping ratio between classes has an effect to the performance of proposed model.If datasets has overlapping between classes, the model can classify correctly at imbalanced ratio not over 25.The appropriate ensemble technique is boosting such as RUSBoost, LogitBoost, TotalBoost, and AdaBoostM1.For Bag model, it classifies correctly at imbalanced ratio not over 10.The best model is RUSBoost that can classify an imbalanced data which have overlapping between classes and high imbalanced ratio.
Imbalanced dataFig.1. Comparisons of classification between the two datasets of balanced data and unbalanced data.

Fig. 2 .
Fig. 2. Linear classification of imbalanced data which bias towards majority class.
Table 1.Cost matrix C for binary classification.

Fig. 4 .
Fig. 4. The framework of steps for building the imbalanced classification models.
(a) Sensitivity or True Positive Rate (TPRate) or Recall This measure shows the ability of the model to classify positive class, which is defined as the ratio between the number of correctly classified positive class and the total number of the actual positive class.The sensitivity is defined by Eq. (3)  =  =  =  ( + ) (3) (b) Specificity or True Negative Rate (TNRate)

Fig. 6 .
Fig. 6.Shows the error of classification of yeast dataset with varying number of decision trees.

Table 2 .
Details of the synthetic datasets and optimal model templates.

Table 3 .
Details of the datasets used in the experiments.

Table 4 .
Confusion Matrix for binary classification.

Table 6 .
Classification results for benchmark datasets with imbalanced ratios 1:10, the best value of C minority is 10.

Table 7 .
Classification results for benchmark datasets with imbalanced ratios 1:25, the best value of C minority is 25.

Table 8 .
Classification results for benchmark datasets with imbalanced ratios 1:50, the best value of C minority is 50.

Table 9 .
Comparison between the best results from our proposed and the method with proposed by Krawczyk et al.(6)

Table 8 ,
show that there are chosen model templates could classify efficiently, such as RUSBoost model for yeast and pima datasets and TotalBoost model for page-blocks dataset.For vehicle dataset, we choose LogitBoost model templates which are inappropriate, RUSBoost model could classify efficiently.The rest of datasets demonstrate that part of model template can classify efficiently and some model templates without chosen; RUSBoost model for shuttle and page-blocks datasets; are able to classify efficiently.