Hidden Markov Model for Sentiment Analysis using Viterbi Algorithm

: Data mining is an activity to extract the knowledge from large amounts of data as a very important information. The type of data in the era of 4.0 is data in the form of text, which is very much derived from social media. Recently, text becomes very important in some applications, such as the processing and the conclusion of a person's review and analysis of political opinion which is very sensitive in almost all countries, including Indonesia. Online text data that circulating on social media has several shortcomings that could potentially hinder the analysis process. One of the drawbacks is the people can post their own content freely, so the quality of their opinions cannot be guaranteed such as spam and irrelevant opinions. The other drawback is the basic truth of the online text data is not always available. Basic truth is more like a particular opinion, indicating whether the opinion is positive, negative, or neutral. Therefore, the main objective of this study is to improve the forecasting accuracy of online text data analysis from social media. The method used is Hidden Markov Model (HMM) with Viterbi Algorithm that applied to extract the dataset sentiment at the 2015 elections in Surabaya from the popular site micro blogging called Twitter. The result of the study is Viterbi algorithm has predicted the best route with the candidate Tri Rismaharini gained a prediction of neutral sentiments, whereas Rasiyo candidates gained sentiment negative predictions as well. The proposed Model is accurate to predict candidate features. It also helps political parties to introduce candidates based on reviews so that they can increase candidate performance or they can manage broad publicity to promote candidates.


Introduction
Data mining is an activity to extract the knowledge from large amounts of data as a very important information. In general, data mining tasks can be classified into two categories: descriptive and predictive. The task of extracting or mining descriptively is to classify the general nature of a data in the database. The predictive Data Mining task is to take conclusions on the last data to make predictions [1].
Hidden Markov Model (HMM) is a statistical model in which a system is being modeled as a Markov process in an unobserved state. On the usual Markov Model, each subsequent state relies on its previous state, this model will show all possible probability between states. Therefore, the probability of transitioning between state becomes the only observed parameter. Markov models are often used for pattern recognition and making predictions. HMM can also be used to find effects on any candidate. Thus, the sequence of steps made by HMM provides an information about the order of the state [2].
The type of data in the era of 4.0 is data in the form of text, which is very much derived from social media. In recent years, natural language processing studies have become more oriented toward opinion mining in social media [3]. Sentiment analysis plays an important role to classify text data into positive, negative, and neutral opinion categories to express opinions in reviews. This process is studied and applied to users who do not explicitly express their sentiment orientation in a particular context [4]. Sentiment analysis has a level of difficulty, among which are assessments expressed in an opinion or part of an opinion addressed to the subject or object, and whether the expressed opinion is positive, negative, and neutral.
Recently, text becomes very important in some applications, such as the processing and the conclusion of a person's review and analysis of political opinion which is very sensitive in almost all countries, including Indonesia. Online text data that circulating on social media has several shortcomings that could potentially hinder the analysis process. One of the drawback is the people can post their own content freely, so the quality of their opinions cannot be guaranteed such as spam and irrelevant opinions. The other drawback is the basic truth of the online text data is not always available. Basic truth is more like a particular opinion, indicating whether the opinion is positive, negative, or neutral [5].
Unstructured data is data that has no specific format or model. Text data, image data and video data are some of the examples of unstructured data. This type of data is estimated to represent 80 percent of the valuable information for most of the organizations [6]. Social media is not only one popular place to talk about a problem, but it is also a place to gather community sentiments about something that is considered viral in the form of text opinions, images or videos [7]. In Twitter the people can post their own content that the quality of their opinions cannot be guaranteed. Therefore, the main objective of this study is to improve the forecasting accuracy of online text data analysis from twitter. The method used is Hidden Markov Model (HMM) with Viterbi Algorithm that applied to extract the text data in Twitter. The Viterbi algorithm proposed by Andrew J. Viterbi in 1967, is a dynamic programming algorithm that finds the most probable sequence of hidden states, called the "Viterbi path" from a given sequence of observed events in the context of a hidden Markov model (HMM) [8]. Viterbi algorithm (VA) on time frequency (TF) distribution is a highly performed instantaneous frequency (IF) estimator [9].
Regional head elections or pemilihan kepala daerah(Pilkada) in a country that adheres to democracy can be held periodically. A political figure who wants to run as a candidate for head of a certain area will see or consider their popularity based on the opinion of the public. The 2015 General Election for Mayor of Surabaya was held on December 9, 2015 to elect the Mayor of Surabaya for the 2016-2021 period. The implementation of this general election coincided with the implementation of simultaneous regional head elections throughout Indonesia on December 9, 2015. There were two pairs of candidates competing in this general election, namely the incumbent pair Tri Rismaharini/Whisnu Sakti Buana which was promoted by the Partai Demokrasi Indonesia Perjuangan (PDI-P) and Rasiyo/Lucy Kurniasari who are promoted by the Partai Demokrat and the Partai Amanat Nasional (PAN). The general election was won by the Tri Rismaharini/Whisnu Sakti Buana pair carried out by the PDI-P with a total vote of 893,087 (86.34%) in accordance with the decision of the Surabaya City KPU on December 22, 2015.
Therefore, this study was conducted to see the sentiment analysis of the two candidates using HMM modeling with the Viterbi algorithm. Thus, it can be concluded whether the results of the sentiment analysis carried out are in accordance with the results of the Surabaya Pilkada in 2015.

Materials and Methods
This study uses Hidden Markov models to foresee the future by considering the hidden problems affected in certain elections 2015 data (reviews on candidates at elections 2015) gathered from the most popular bloggers are "Twitter " as Datasets. The stages of this study are described in Figure 1.  Figure 1, the first step is crawling data. In this step, data is gathered from the most famous micro blogger site i.e. "Twitter ". The number of tweets is too big so it is impossible to select a manual tweet therefore Python is used "Twitterscraper" as the interface to extract tweets directly from Twitter. Extracted data is captured to a text file in XLSX format (Excel) because it is a human readable format as well as the machine also easy to reduce it. The second is data preparation, data preparation serves to manipulate the data so it looks neat and any variables are needed to be analyzed. In this case the data obtained has 21 variables, but the variable that we analysis just a text variable and a new variable that is a candidate variable. EKSAKTA journal.uii.ac.id/eksakta February 2021, Volume 2, Issue 1, 18-23 A text variable is a comment to a candidate while the candidate variable is the name of the candidate that gets the comment. The next is preprocessing data, this is the important steps for data mining processes. The data used in the mining process is not always in the ideal conditions for processing. Sometimes in this data there are a variety of issues that can interfere with the results of the mining process itself such as those with missing values, redundant data, outliers, or data formats incompatible with the system. Therefore, to address this issue required stage preprocessing. Preprocessing is one of the steps of eliminating problems that can interfere with results rather than processing data. In terms of document classification using the type of text data, there are several types of processes that generally include folding case, filtering (removing punctuation), stop word, stemming. The preprocessing stage is as follows: 1. Stop word: a stage for removing unnecessary words such as "yang", di", "ke" and so on. This is done to improve the effectiveness of the system so that the data to be processed is considered important text only. The Stop word used in Python is Sastrawi. 2. Cleaning: a process to clear the document of unnecessary words to reduce the noise in the analysis process. 3. Stemming: a method for mapping the token to its basic form (Rizqon DKK,2017). This is done to change the word that is to be said to be a basic word such as "melepas" to "lepas", "berjumpa" to "jumpa", and so on. The fourth step is sentiment identification. Identifying the tweet expressions are important thing to do a sentiment analysis with the Python programming language to distinguish the extracted tweets in categories such as positive, negative & neutral, because the extensive library of each word is extracted compared to the popular positive words and negative words. After the classification stage, the tweets score is defined to rely on the complete tweets specified as positive, negative or neutral. There are a number of methods to calculate the sentence sentiment value but here we use one of the popular methods.
where, is positive word, N is Negative Word and O is total words. In this case, we can specify the sentiment. If the value of sentiment is more than 0 then it can be deduced from the positive sentiment. If the value of sentiment is less than 0 it can be deduced from the negative sentiment. If the value of sentiment equals 0, it can be deduced from the Neutral sentiment. The next step is model formation using HMM then the last is prediction the result analysis.

Hidden Markov Model (HMM)
According to [10] a Markov Chain is useful when we need to calculate the probability for an observable sequence of events. However, in HMM, the events we observe are hidden (we don't observe them directly). HMM is formed from several variables i.e. S is the number of states in a Markov model, A is the probability of a state transition, B is the probability of emissions in a state, and π is the initials probability of the state on a Markov model. HMM can be defined as follows: = ( , , ) A is a probability of transition from state i to State j:  Figure 2 can be concluded that with the data obtained via twitter, the candidate's observation of positive sentiment has a 74% percentage for the Tri Rismaharini candidate and 26% for the Rasiyo candidate, the candidate's observation of negative sentiment has a 71% percentage for the Sri Rismaharini candidate and 26% for the Rasiyo candidate, and Candidate observations on neutral sentiment have a percentage of 74% for candidate Tri Rismaharini and 26% for candidate Rasiyo. So, most of the candidates who have the highest percentage of each sentiment are the Tri Rismaharini candidates. However, using the chart above and Markov's assumptions, researchers can easily predict whether the next tweet will be positive, negative or neutral. By using the Viterbi algorithm, researchers can provide the best route for the state in the order of the first day is Tri Rismaharini and the second day is Rasiyo. Probabilities for the observation of Tri Rismaharini and Rasiyo on Hidden State are as follows:

Conclusions
The use of HMM to predict produces initial state, transition probability, and emission probability so that it considers hidden states that can affect accurate forecasting. The proposed model is more accurate for predicting sentiment on candidates from the 2015 Pilkada. It also helps future Pilkada to know the candidates based on sentiment so that they can improve each candidate's performance to get good sentiment or they can manage wide publicity to know every sentiment on the candidates. In this case the algorithm of Viterbi has predicted the best route with the candidate Tri Rismaharini gained a prediction of neutral sentiments, whereas Rasiyo candidates gained sentiment negative predictions as well.