Automatic Advertisement System using Text Mining Technique

Nowadays, there are a wide variety of advertising types. Indeed, one of them is the Internet advertising and it is growing rapidly. However, with the different behaviors of consumers who visit a web site, many websites are developed and proposed in order to meet the needs of many different people. Indeed, if a website is interested by many people, rating of the website is increased. However, after accessing to the website, if the Internet advertisings found on the website does not relate with the consumer interests, these online advertising may not be effective. As the result, this work presents a method of text mining approach to suggest the online advertisements that are relevant to the content of the website. After experimenting by the Random Forest and Bayesian Network classifier, the result of Random Forest is better than the result of Naïve Bayes, where the Random Forest shows the accuracy at 83.97%


Introduction
The internet advertising is very growing rapidly.Many websites allow to show and relate the advertising such as Google Adwords.The cost comes from user's click called PPC (Pay per Click).Sometimes, the advertising did not match with the content of websites, then user is not interested in the advertising.
Many research interested in Text Mining techniques.Un (1) proposed the Text Mining with Information Extraction called DISCOTEXT.This algorithm is able to discover the patterns in extracted text, strict matching of string.Jacopo et al (2) proposed the text mining in computational advertising using n-grams, topic models and sentiment analysis.Vishal (3) proposed the survey of text mining classification.Text mining is similar to the data mining, but text mining is able to work with unstructured data sets such as emails, documents and HTML.Hans (4) proposed the Decision Trees and Text mining technique for extending taxonomics for ontology.Xiao (5) proposed the text mining in music mood classification by collecting the 8,839 songs with tags and lyrics.Bjornar and Chinatsu (6) proposed the fast and effective text mining using k-means linear-time document clustering.
From the previous research, the text mining is used for classification the document in different objective, and able to use for the unstructured data such as HTML.This research interested in the content and tried to extract the characteristic of website using text mining technique.Moreover, this research compares the accuracy rate of selected classification algorithms to create the automatic Advertisement System.

Automatic Advertisement System using Text Mining Technique
This system consisted of 4 steps that are Pre-processing, Feature Extraction, Model generating, and Evaluation as shown in Figure 1.

Pre-Processing
The system firstly collects the content from webpage.This research creates the training set from the 1600 webpages that separates and labels into 8 classes (200 pages per class): immovable property, sport, car, computer, education, games, health, and travel.

2.1.1
HTML tag removing is the process to remove the tag of HTML such as <body>, <head>, <html>.DOI: 10.12792/icisip2015.044 The HTML tag will effect with the frequency of content but it is unable to represent the character of the class.

2.1.2
Word Segmentation is the process to separate the content from sentence to word.This research interested in THAI language.The best algorithm for THAI segmentation is Longest Word Matching Algorithm (7) as shown in Figure 2.

2.1.3
Stop word is the process to remove the unnecessary word which is not effecting with the content meaning, but it is usually found in the content that will effect with the frequency feature extraction.The stop word is the preposition, conjunction, pronouns, adverb, and interjection.The example of the frequency stop word is that and which.
From this process, the content will separate to clear word removing the HTML tag and stop word.

Feature Extraction
This process is extracting the word to the feature.Firstly, all word of all class will count the frequency, and select the top 100 words to create the feature called Bag of words.The weight of feature extraction is found in many ways such as Fig. 2. Longest Word Matching Algorithm.

2.2.1
Boolean weighting is calculated from equation ( 1)

2.2.6
Entropy weighting is calculated from equation ( 6) After this process, the feature will select by frequency of the word, and calculate the weight.The next process is try to generate the model by the classification.

Model Generating
The classifier of text mining can be classify by three ways that are unsupervised, supervised and semi supervised classifier for example Bayesian network classifier, Decision Tree, K-nearest neighbor (KNN), Support Vector Machines (SVMs), Neural Network.This research selected to use and compare the 2 different classifier that are Bayesian network and Random forest classifier.

2.3.1
Bayesian Classifier (8) is based from the priori probability and conditional probability.The advantage of the Bayesian classifier is that it requires a small training data, and it is robust to the missing data.The Bayesian Classifier can be calculated from equation ( 7) P(A|B) = P(B|A)P(A) P(B) 3.2 Random forest Classifier (9) is as follows: 1. N is the number of instances, and M is the attributes.
2. m of input variable to be used to determine the decision at a node of the tree; m is less than M.
3. Choosing N times with replacement from all N for this tree.4. Each node in the tree randomly chooses m variable base the decision at that node.Each tree is fully grown and not pruned.
The advantage of random forests are it produces a highly accuracy rate, it is able to handle the large number of attributes, and fast.
The main objective is trying to test between the required small training data algorithm and handled large number of attribute data.

Evaluation
This paper will test the data using K-Fold cross validation and Hold-out cross validation, and evaluate the model using accuracy rate calculated by Table 1

Experimental Result
This research collects the 1600 webs from the web directory divided into 8 classes (200 webs per class), and separates the experiment into 5 experiments.JAVA language is used to develop and test the algorithm.

3.1.1
Experimental time is divided in 3 steps showed in Table 2.This research tested the various popular feature weights that are 1) Term Frequency (TF), Term Frequency-Inverse Document Frequency (TF*IDF) and Boolean.
3.1.2Accuracy rate tested by Hold-out cross validation.The confusion matrix of Bayesian classifier will show in Table 3(a-c) and Figure 3, and of Random forest classifier will show in Table 4(a-c) and Figure 4.
Table 3(a).Confusion of Bayesian Network using TF.

3.1.3
Accuracy rate tested by K-fold cross validation.This research selected to use k = 5 as shown in Table 5 and 6.The comparison between Bayesian Classifier and Random Forest will show in Figure 5.

Discussion and Conclusions
From the experimental result is able to summarize as follows: The highly accuracy from Hold out cross validation of Bayesian Network is from Term Frequency (T) weight at 72.81%, and of Random Forest is from Term Frequency-Inverse Document Frequency (TF*IDF) at 80.93%.
The highly accuracy from K-fold cross validation of Bayesian Network is from Boolean weight at 71.94%, and of Random Forest is from Boolean weight at 89.75%.
However, the paired T-test algorithm results found that the different feature weight is not effecting with the accuracy rate, but the different algorithm is effecting.
The Random Forest is showing the higher performance than Bayesian Network Classifier, because of highly number of Bag of Words.The future work will combine the sentiment analysis from social network such as Facebook because of the various contents.It is difficult to classify.
Prior probability of hypothesis A P(B) Prior probability of training data B P(A|B) probability of A given B P(B|A) probability of B given A 2.

Figure 3 .Figure 4 .
Figure 3.The comparison of accuracy of various feature weight.

Table 2
Experimental time.

Table 3 (
c). Confusion of Bayesian Network using Boolean.

Table 4 (
a). Confusion of Random Forest using TF.

Table 4 (
b). Confusion of Random Forest using TF*IDF.

Table 4 (
c). Confusion of Random Forest using Boolean.

Table 5 .
Accuracy Bayesian Classifier by K-Fold Cross Validation Test Statistics used to test the significant of feature weight selected the accuracy rate to test from the K-Fold cross validation experiment.The significant rate (α) is equal to to 0.05.The paired T-Test experimental result will show in Table 7 and Table 8.

Table 9 ,
The P-Value is less than significant rate that means the algorithms is significant with the accuracy.