Predicting Customer Satisfaction through Sentiment Analysis on Online Review

: User-generated content, such as user reviews, posts, tags, ratings, and opinions on the internet, can be used as a business indicator if collected and appropriately analyzed. One of the examples is predicting customer satisfaction through implementing big data analytics on online reviews. In analyzing the user-generated content to predict customer satisfaction, the author implements machine learning approach using the Sentiment Analysis method. Five-fold cross-validation was performed to train the classification model. The training was performed with a combination of tokenization methods: term frequency-inverse document frequency (tf-idf) and bag-of-words; n-gram types: unigram, bigram, trigram, and combination of unigram, bigram, and trigram; and machine learning algorithms: linear support vector classification (LinearSVC) and multinomial naïve bayes (MultinomialNB). The result was then evaluated using classification performance metrics such as precision, recall, F1 measure, and AUC score. The result shows that the tf-idf vectorizer performs similarly to the bag-of-words method. A similar result was also observed for machine learning algorithm selection. Both MultinomialNB and LinearSVC produce the same performance. Low-level n-grams (such as unigrams and bigrams) tended to have higher precision, recall, F1 measure, and AUC score than high-order n-grams (such as trigrams). The best results were achieved by combining unigrams, bigrams, and trigrams, resulting in an average performance score of 0.94 for all measurements. From the result and analysis, the author finds that predicting customer satisfaction using text and sentiment analysis methods on user-generated content is possible. The model’s performance in this experiment is decent, with high precision, recall, F1, and AUC score.


INTRODUCTION
The recent digital environment development drives the generation of a vast amount of user-generated content.User-generated content, such as user posts, user reviews, tags, ratings, and opinions on the internet, can be used as a business indicator if collected and appropriately analyzed.Converting user-generated content into information can also provide organizations with detailed and credible information about their customers' opinions and perceptions of their services [1].Hence, a firm or manager's ability to convert insight from user-generated data into valuable information could drive business success [2].
One example of user-generated content is a customer review of a service or a product.In a customer review, customers able to quantitatively evaluate the products or services they have purchased and illustrate the reason in writing [3].The current trend demonstrates that before making a purchase of goods or services, consumers look for pertinent information to lessen their uncertainty of choice.Because in the information search process, ratings and reviews are the most trusted data sources of consumers [3], people tend to rely on this information before making a purchase decision.Thus, firms may benefit from mining and analyzing user-generated content data such as comments and sentiments [2].
Analyzing user-generated content to drive business decisions could be seen as an implementation of big data analytics for business.It is critical in big data environments to process and act quickly on available data.Although mining data from big data is a challenging task, big data has the potential to revolutionize all areas of science [4].The implementation of big data analytics could be implemented to understand customer needs better.One company's customer relationship management performance could be improved by better understanding customer needs [5].
An example of the application of big data analytics in business is the measurement of a customer's perception or experience with a product or service using unstructured data such as user reviews or social media posts [6].The need to monitor customer experience arises as a result of customers interacting with businesses through multiple touch points across a variety of channels and media, deriving in more complex customer journeys [7].Due to the immense diversity and size of social media data, it is difficult for humans or businesses to gather the most recent trends and summarize the situation as it stands about products; this necessitates the need for automated opinion mining [8].Sentiment Analysis can tackle this challenge as it can extract opinions from enormous datasets promptly.
Measuring customer satisfaction is critical for a company because it is strongly linked to financial performance [9].As the customer journey has become more complex, measuring customer satisfaction through big data analytics, particularly natural language processing, is essential [10].
The usage of user-generated content as a source for determining customer sentiment had performed by [11].Using Twitter data, [11] performed sentiment analysis to gather insight from public opinion by classifying the tweets based on their positive or negative sentiment value.Sentiment analysis or opinion mining is the study of opinions, attitudes, and emotions toward an entity [12].The entity can be topics, individuals, or events.In the scientific community, the two terms are interchangeable, refering to the same thing [4].
Another work by [13] developed a framework for measuring costumers satisfaction towards mobile application products by analyzing online review data using sentiment analysis combined with VIKOR method (ViseKriterijumsa Optimizacija I Kompromisno Resenje).The utilization of opinion mining and sentiment analysis for analyzing online reviews written by customers can also be implemented in determining key attributes affecting customer satisfaction within hospitality service [14].

A. Data Collection and Pre-processing
The general overview of the design methodology can be seen in Fig. 1.The data gathered are users' reviews about a point-of-sales company application from the Google Play Store platform.The data were then pre-processed to remove unnecessary characters that could poorly affect the analysis result, such as punctuation, emoticon, numeric value, and white spaces.Another pre-processing that is performed on the data is the stemming process.Stemming is transforming a word into its stem (initial) form.The stemming process in this research was performed by utilizing Sastrawi, a python library that specialized in stemming Indonesian words.After the stemming process, the author continues the data pre-processing by removing stop words from the document.Stop words are a list of words commonly used as pronunciation or particle.These words frequently appear in a document but do not have sentimental value.Removing the stop words is necessary since it could alter the analysis result due to their high number of appearances on the documents.Text data transformation during data pre-processing is illustrated in Table 1.

B. Feature Extraction
When building a classification model, selecting the feature vector is critical task to do.Selecting a suitable feature vector could hugely impact in the success level of our classifier.The feature vector is used to construct a model from which the classifier learns and can classify previously unseen data [11].In this research, tokenization is performed to the review data to convert the text data

C. Model Development and Evaluation
The top 100 tokens from the vectorizations are then used as a feature in building the machine learning model using the Linear Support Vector Classification (LinearSVC) and Multinomial Naïve Bayes (MultinomialNB) algorithm.The labeling process of each review as positive or negative was done by using the rating score.The author grouped the reviews with one and two-star ratings as negative (dissatisfied) and four and five-star reviews as positive (satisfied).The training and assessment of classifier performance are performed by implementing five-fold cross-validation.The model's performance was evaluated using classification performance metrics such as precision (1), recall (2), F1 measure (3), and area under the receiver operating curve (AUC).
With  is the true positive,  is false positive, and  as false negative.

D. Opinion Mining on Satisfied or Dissatisfied Review
The author uses bag-of-words vectorization combined with trigram tokens to assess the opinion about the services.The bag-ofwords vectors are chosen because it is easier to interpret since it is more intuitive for humans.Trigram was chosen following the general structure of an Indonesian sentence that commonly consists of three attributes: subject, verb, and object or adjective.The opinion mining process was then performed for positive (satisfied) and negative (dissatisfied) reviews to determine the main concern for each review category.The top ten of the most frequent tokens are then analyzed.

RESULTS
The result is from 6,994 reviews written in Indonesian received by the company during 2019 and 2022.The reviews sentiment's distribution is pictured in It can be seen that the data distribution is imbalanced between positive and negative classes.Reviews with positive sentiment dominate the distribution with 6,150 data points or approximately 87% of the total reviews.The rest, or 13%, is the reviews with negative sentiment.

Fig.3 Precision performance of each combination.
The precision score for this experiment ranges from 0.89 to 0.95 (Fig. 3).The lowest happens in all trigram token combinations with a precision value of 0.89.While the highest precision score, 0. A similar result pattern was also observed for the recall score, where trigram combination along all vectorizers and model algorithms gave the lowest performance score.Generally, the recall score for each combination is high, with a minimum score of 0.90 and the highest score of 0.95 (Fig. 4).The  The result of opinion mining from user reviews is shown in Table 3.The main concern for the negative (dissatisfied) review is the application that often gets an error (aplikasi sering error) along with the inability to log in to their account (tidak bisa login, ga bisa buka, gak bisa masuk).Meanwhile, for the positive review, they express their opinion about how well the application helps the user's business (aplikasi sangat membantu, sangat membantu usaha) and how easy to operate the application (easy to use, bagus mudah gunakan).

DISCUSSIONS
In measuring the model performance, weighted average scoring was applied for precision, recall, and f1 metrics.The weighted average is performed in order to take into account class imbalance in the dataset.Without considering the class imbalance, the performance might be biased since it could perform excellently in one type of class but not perform well in the other class.
The poor performance of high-order n-grams (bigram and trigram) in several measurements is aligned with previous research by [15].According to their findings, sentiment analysis of documents at the sentence level using unigrams outperforms higher-order n-grams [15].This phenomenon could be explained by the fact that the frequency of bi and trigrams per sentence is even lower than that of unigrams at the sentence level [16].

Fig. 1
Fig.1 Research design implemented in this paper.
. The tokenization or vectorization builds a bag of words based on their frequency in the document.The author implements two vectorization methods: bag-of-words and term frequency-inverse document frequency (tf-idf) vectorization.Five n-grams variations are used during vectorization process.The five n-grams variations are: unigram, bigram, trigram, unigrambigram (mixed of unigram and bigram), and unigram-bigram-trigram (mixed of unigram, bigram, and trigram).
95, is achieved by unigram, mixed of unigram-bigram (uni-bi), and mixed of unigram-bigram-trigram (uni-bi-tri) tokens in combination with tf-idf vectorizer, both for LinearSVC and MultinomialNB algorithm.The highest score was also achieved by unigram, unigram-bigram, and unigram-bigram-trigram tokens in combination with the bag-of-words vectorizer and MultinomialNB algorithm.

Table 1 .
Text data transformation process.

Table 2 .
Tokenization based on n-gram types.

Table 3 .
Top ten most frequent terms for positive and negative reviews