Mining of Thai Politics on Facebook Status Updates

Currently, social networking websites are widely used for communication among people. A multitude of websites are available for sharing ideas and knowledge. Facebook is the most popular social networking website in the world. Thai people widely use status updates on Facebook for all types of discussion topics: political issues, religious issue, technology, education, etc. The discussion topics express their opinion in the form of text. The opinion is very important part of making any decision. This paper is proposed to mine opinion of Thai people about the current government revolution. The opinion was extracted from the Facebook statuses updates which were written in Thai language. Two feature extraction methods were implemented. First, the traditional pre-process of text mining was used for extracting the features. Second, positive and negative words were collected to construct a sentiment lexicon to extract features. Comparative experiments were performed among Naïve Bayes, SVM (Support Vector Machines), KNN (K-Nearest-Neighbor) and decision trees. From the experimental results, they were shown that KNN gave highest accuracy when using the lexicon for feature extraction.


Introduction
Social networking websites are widely used for communication with people around the world.Those websites are a very powerful medium for sharing ideas and knowledge.Moreover, they are used as spaces for expressing opinion.Currently, there are many social networking websites, such as Facebook, Twitter, and LinkedIn.Facebook is the most popular social networking website.There are 900,000,000 -estimated unique monthly visitors (1) .Facebook has become a communication tool in people's daily life.People can write their perceptions on any discussion topics, such as political issues, religious issues, technology, produce, movie reviews, etc. (2) .This open style encourages people to naturally express their perceptions (3) .Therefore, the available datasets on Facebook has been exponentially increased and become very useful.Therefore, many researchers use Facebook as a source of knowledge instead of paper based surveys.
In Thailand, Facebook is the most popular social networking website.Most Thai people share and express their opinion, knowledge, and experience using short statuses update on Facebook.The status updates on Facebook is a very useful dataset for retrieving the user's knowledge or opinion on a particular topic.Facebook is a good source for finding certain information of Thai people.This paper aims to mine opinion of Thai people about the current government revolution from Facebook status updates written in Thai.First, the Facebook status updates are retrieved from the Facebook website by using Facebook Graph API.Two feature extraction methods were implemented.In the first method, Facebook status updates are pre-processed by the traditional pre-process of text mining to extract features.The second method extracts features from a lexicon which requires a dictionary of words, each of them annotated with its semantic orientation.The positive words and negative words are collected in the lexicon.Then comparative experiments are performed among Naïve Bayes (4) , SVM (5) , KNN (6) and decision trees (7) to discover the best classifier for mining opinion of Thai politics.
The remainder of the paper is organized as follows.The related works are mentioned in section 2. The proposed DOI: 10.12792/icisip2015.046methodology is explained in section 3. The experimental results are shown in section 4. Finally, conclusion is given in section 5.

Related Works
Opinion mining or sentiment analysis was proposed to determine opinions, attitudes and emotions of humans to some topics (8) .The retrieved information from opinion mining is very useful for making decisions in many fields.For example, a government wants to know the opinion of people before enacting policies.Opinion mining becomes a popular study on social networking websites because of the fact that users are free to express their opinion through social networking function.Many researchers have tried to extract opinions in different ways: natural language processing, text analysis and computational linguistics.
Akaichi et al. (9,10) proposed sentiment analysis on Facebook status updates by using Naïve Bayes and SVM.The purpose of the research was to identify the nature of the status update and to link them to behaviors and sentiment characteristics.The sentiment classification is performed on the 260 Facebook status updates posted by Tunisian users during the revolution.The three lexicon types were developed to give more information language of online social network.First, the acronyms' lexicon denotes the most used acronyms, such as lol, gr8, cu.Second, the emoticons' lexicon denote the most used emotion on Facebook, such as , , ;), etc. Third, the interjections' lexicon contains Wow, Haha, Oh dear, etc.Then, they decided the lexicons either positive or negative.Next the feature extraction was performed to transform from original textual dataset to structured dataset.In this step, the stop words removal and stemming were done to reduce search space.The text representation is the occurrence of words.The various n-gram pattern and part-of-speed tags are used as features to classify sentiment.Next, the dataset are labeled with two classes, positive and negative.60% of pre-processed dataset is trained to create a model and 40% of pre-processed dataset is tested to evaluate the model.The n-gram patterns were experimented to evaluate with 10-fold cross validation, i. Ortigosa et al. (11) proposed a new method for analyzing sentiment on Facebook status updates written in Spanish.They implemented a method in SentBuk to retrieve the status updates on Facebook and classify them.The method is a combination of lexicon-based and machine learning techniques.It tokenizes a message and assigns a score to each token using a dictionary of words, emoticon and interjection.If the token is classified to positive, the score will be 1.The score will be -1 if the token is negative, and it is assigned to 0 if the token is neutral.This research tried to prepare data using lexicon-based before using machine-learning techniques.This research said that the accuracy is strongly influenced by the context, position of words, figures and speed.The experiments were conducted by classify dataset with different four approaches, lexicon based, decision trees with lexicon-based tagging, Naïve Bayes with lexicon-based tagging, and SVM with lexicon-based tagging.From the experimental result, it was shown SVM with lexicon-based tagging gives 83.27% of accuracy for sentiment analysis on Facebook dataset when compared with a human judge.
Troussas et al. (12) applied Naïve Bayes to classify the people's feelings to assist language learning.Moreover, they tried to compare Naïve Bayes with Rocchio and Perceptron on unigram model.In the comparison, dataset was collected aground 7,000 status updates from 90 users and manually labels as positive or negative.The dataset is randomly split to 50% for training and 50% for testing.Three algorithms are evaluated in terms of precision, recall and F-score.From the evaluation, Rocchio has the best performance than the others.It gave 75% of precision, 73% of recall and 74% of F-measure.
Shrivatava and Pant (3) developed a classifier to extract three classes, GOOD, BAD, and AVERAGE, on Facebook status updates.The Facebook puller was implemented to collect 2,000 comments from Facebook.Then the collected dataset was classified into the three classes by using a domain dictionary containing some synonyms of features.Next LIBSVM program (13) was applied to train and test accuracy of the system.From accuracy test, it showed that the average accuracy is 70.5% on a Facebook dataset.
Jamoussi and Ameur (14) proposed sentimental orientation of Facebook comments by using linguistic approach.This paper grouped words present in corpus into two dictionaries, positive and negative dictionaries, that exploits the emotions symbols.Then the dictionary was used to predict either a positive or negative on Facebook comments.The dataset is collected by using Facebook API and presented in XML.Only political category pages are selected to construct the corpus, which consists of 23 Tunisian pages, 5 French pages and 4 English pages.The corpus is decomposed into two disjoint corpuses, learning corpus and test corpus.The learning corpus is divided into two subsets of comments.The first is based on comments containing emotion symbols.The second is based on the rest comments which have no emotions.The emotion symbols are used to split a comment and to identify words in either positive or negative dictionary.From the experiment results, it was shown that the proposed method has an efficiency of 80.89% and lower error rate of 18.75%.
Irfan et al. (15) proposed a survey of text mining on social network websites.It studied not only the various text mining techniques but also the pre-process of text mining.In addition, the paper provides the several ideas for future research in social networking websites.For pre-process, the paper concluded that there are three basic methods, feature extraction, feature selection and document representation.The feature extraction is divided to morphological analysis, syntactical analysis, and semantic analysis.The morphological analysis deals with word representation that consist of stop word removal, stemming.The syntactical analysis deals with the grammatical structure of language.The syntactical analysis deals with part of speed tagging and parsing.For feature selection, it selects only important features to remove redundant information and then make a reduced cost.The important features are mostly selected by scoring the words.The score of the words may be calculated from frequency, latent semantic indexing, and random mapping.Then they are represented as a vector space model.For text classification, the paper grouped text classification approaches into machine learning based approach, ontology based approach, and hybrid approach.From the survey, they conclude that the hybrid approach has better performance than applying a single method.For future research, the paper suggests that interpreter of emotions on social networks is a challenging issue because textual data on social network becomes very large, noisy, dynamic, unstructured language.For sentiment classification approaches, they are grouped into 3 groups.The first group is lexicon-based approaches that classification by using words of lexicon.The second group is machine learning approaches, such as SVM, Naïve Bayes, Rocchio and so on.The third group is hybrid approaches that combine the lexicon-based approaches and machine learning approaches together.The hybrid approaches have better performance than the others.

The Proposed Methodology
Fig. 1 shows the overall process of the proposed methods.First, 467 status updates written in Thai are collected from Facebook during 22 May 2014 until 17 July 2014 by using Facebook Graph API (16) .The status updates are about Thai people's opinion of the current in country revolution.The status updates were manually identified whether positive or negative by human and divided into 259 positive and 208 negative status updates.The 102 positive and 101 negative words are collected from Facebook to construct a sentiment lexicon.After collecting the dataset and lexicon, the following steps will be performed.
Step 1 Cleaning: Before the pre-processing step, the dataset has to be cleaned because the writing on social networking websites is normally not correct.Therefore, some words in a dataset are changed to the correct form to improve efficiency and effectiveness.
Step 2 Feature Extraction: Since the dataset is text data that is unstructured, it has to be transformed to structural data that is a vector space model.The vectors are used to build a classifier.In this paper, the vectors are constructed from two methods as follows.The first method: The vectors are constructed from using the traditional pre-processing steps of text mining.First, Facebook status updates are segmented by using the longest matching approach (17) , which is dictionary-based approach, to find a collection of features.Each feature is a longest meaning Thai word.Second, stop words, such as prepositions, pronouns, numbers, are removed to improve the efficiency because a large number of the stop words always appear in status updates and may reduce the efficiency of the text classification.Third, features having similar meaning are matched into a feature.It is used to reduce the large number of features.In this step, the matching is done by using a dictionary that is constructed from a collection of words and the similar meaning of the words.After that a status update is transformed to a vector.The vector consists of the set of weights of features as shown in Table 1.The weight of features is calculated by using Boolean weighting as shown in equation (1), where w ki is the weight of feature i in status update k and f i is the number of occurrences of the feature.The weight of feature is 1 if the feature occurs in the status update.Otherwise, the weight of feature is 0.
The second method: The vectors are constructed from the lexicon that is created from a collection of positive and negative words.The collection of words is used as a set of features.For each status update, if a positive or negative word is found in the status update, the weighting is set to 1; otherwise, the weighting is set to 0 as shown in equation (1).
tp tn Accuracy tp tn fp fn (2) Step 3 Feature Selection: After extracting features, the feature selection is performed to eliminate insignificant features.A feature is removed if the occurrence of feature is less than a threshold.The threshold in this paper is the number of status updates.Four thresholds are set in the experiments: 4, 8 and 12.
Step 4 Building Classifiers: To evaluate the extraction method, the pre-processed datasets are divided to training set and testing set using 10-fold cross validation.The training set is used to create classifiers based on four methods, Naïve Bayes, SVM, KNN and decision trees.
Step 5 Evaluation: The classifiers are evaluated on a testing set.The accuracy of the prediction is used to evaluate the effectiveness of classifiers.The accuracy can be computed in equation (2), where tp is the number of correct predictions that an instance is positive, tn is the number of correct predictions that an instance is negative, fp is the number of incorrect predictions that an instance is positive and fn is the number of incorrect of predictions that an instance negative.

Experimental Results
In this paper, all experiments are performed to analyze the accuracy of opinion mining.All experiments are run on a 1.90GHz Core™ i3 PC with 4G of main memory, running on Microsoft Windows 7. All algorithms are implemented on Matlab.Datasets are generated in different ways.The first method tried to extract features from all the meaning words that appeared in the messages collected from Facebook status updates.The Facebook status updates were pre-processed by using the pre-process steps of text mining.The resulting dataset is called Data1.The second method extracted features from the lexicon that are created from a collection of positive and negative words.Data2 is the dataset that is pre-processed from the second method.Data2-4, Data2-8, Data2-12 passed the feature selection step using the threshold values, 4, 8 and 12, respectively.The characteristic of all datasets is shown in Table 2. Table 3 shows the accuracies of Naïve Bayes, SVM, KNN and decision trees on Data1.The Naïve Bayes gives the best accuracy when comparing to SVM, KNN and decision trees.The accuracy of decision trees is less than 50% and the accuracy of SVM, Naïve Bayes and KNN is still low because of the many different features appearing in the dataset and a lot of zero values in the dataset make the effective classifier low.
Table 4 shows the accuracies of Naïve Bayes, SVM, KNN and decision trees on Data2.From the table, it is shown that the accuracies of KNN and SVM are increased and the accuracy of KNN is increased till 62.37% while the accuracies of decision trees and Naïve Bayes are down.Therefore, KNN is the best approach for analysis of Thai people's opinion when using the lexicon for feature extraction.
Table 5 shows the accuracies of KNN on four datasets, Data2, Data2-4, Data2-8 and Data2-12.From the table, it is shown that the accuracies are not too much different.The best accuracy is 63.58% on Data2-8.
From Table 3, Table 4 and Table 5, the best method for mining Thai people's opinion about the revolution is KNN when comparing to Naïve Bayes, SVM and decision trees.Extracting features from the lexicon is better than using all the meaning words that appeared in the Facebook status updates.

Conclusion
This research proposed to mine opinion of Thai politics from Facebook status updates.It tried to extract features from different methods.The first method is that features are extracted from the status updates using traditional pre-process of text mining.The second method is to extract features from the lexicon which requires a collection of positive and negative words.Then Naïve Bayes, SVM, KNN and decision trees are evaluated on the pre-processed datasets in terms of accuracy.Experimental results show that KNN gives the highest accuracy when using positive and negative words for extraction.It provides 63.58% of accuracy for mining opinions of Thai politics.To improve the accuracy in the future works, sarcasm detection of opinions about the politics is a challenging issue.Moreover, adding acronyms' lexicon, emoticons' lexicon, and interjections' lexicon may be improve the accuracy.
e. unigram, bigram, trigram, a combination of unigram and bigram, a combination of unigram and trigram, a combination of bigram and trigram, a combination of unigram and bigram as well as trigram.The experimental results showed that Naïve Bayes had high accuracy when the bigram was used for feature extraction, while SVM outperformed the Naïve Bayes when the unigram was used for feature extraction.

Table 1 .
Example Vector

Table 2 .
The characteristics of datasets

Table 3 .
The accuracy on Data1