Implementation of Data Mining Using C4.5 Algorithm on Customer Satisfaction in Tirta Lihou PDAM

This application applies the C4.5 Algorithm to decide customer satisfaction, the C4.5 algorithm is one of the algorithms used to classify or segment, or group and it is predictive. This type of research is a classification with the concept of data mining involving 150 customers of PDAM Tirta Lihou in Totap Majawa Kab. Simalungun can be categorized as: "Satisfied and Dissatisfied". The meaning of Data Mining is an interdisciplinary subfield of computer science and statistics with the overall objective of extracting information (with intelligent methods) from data sets and converting information into understandable structures for further use. There are 5 criteria that can affect customer satisfaction, among others: Service Facilities (x1), Price Rates (x2), Smooth Water (x3), Corporate Image (x4), and Location (x5). The results of processing the C4.5 method using the RapidMiner Studio 5.3 software mean that Rapid Miner is a solution for analyzing data mining, text mining, and predictive analysis. Rapid Miner uses various descriptive and predictive techniques in providing insight to users so that they can make the best decisions with the level of accuracy, namely, class recall and class precision values, it is explained that the "Satisfied" category produces a class recall of 97.80% and a class precision of 97.80%. 98.89% and the "Not Satisfied" category resulted in a class recall of 98.31% and a class of precision of 96.67%. And the above accuracy results from the calculation of the C4.5 algorithm is 98.0%.


INTRODUCTION
Customer satisfaction is one of the main goals of any company whether it is a product sold or a service offered. This aims to attract and retain its customers so that each company must be able to understand carefully what expectations all its customers want, so this makes every company must be able to know the level of satisfaction of each of its customers. PDAM or Regional Drinking Water Company is one of the regionally owned business units, which is engaged in the distribution of clean water for the general public. PDAMs exist in every province, district, and municipality throughout Indonesia. PDAM Tirta Lihou Totap Majawa Production Unit is one of the regional drinking water companies located and operating in Pagar Jawa, Regency of Simalungun, North Sumatra, According to some parties or service customers at PDAM Tirta Lihou Totap Majawa Production Unit is very good and some parties complain Due to the slow flow of water that enters the house and does not even work, therefore the PDAM must find a solution so that customers can feel satisfied with the services provided to their customers, therefore to overcome the above problems the authors are interested in examining the level of customer satisfaction at PDAM Tirta Lihou with the concept of implementing data mining using the C4.5 algorithm where data mining is the process of looking for patterns or interesting information in selected data using certain techniques or methods. Data mining is a process that employs one or more machine learning techniques for analyzing and extracting knowledge automatically.
Some of the developments carried out by C4.5 can overcome missing values, continue data, and pruning. The C4.5 algorithm has input in the form of training samples and samples, training samples in the form of sample data that will be used to build a tree that has been tested for correctness, while samples are fields data that we will later use as parameters in classifying data (Rismayanti, 2018). There are four steps in the process of making a decision tree in the C4.5 algorithm, namely (Arifin & Fitrianah, 2018): 1. Choosing attributes as roots, based on the value gain highest existing attributes. 2. Creating a branch for each value, meaning-making a branch according to the number of the variable values gain highest. 3. Divide each case in the branch, based on the calculation of the highest gain value, and the calculation is carried out after the calculation of the value gain initial highest and then the process of calculating the gain highest again without including the initial gain variable value. 4. Repeating the process in each branch so that all cases in the branch have the same class, repeating all the processes for calculating the highest gain for each branch case until the calculation process is no longer possible. "The decision tree is also one of the most popular classification methods and is widely used practically. The decision tree is a well-known classification method. The decision tree is one of the most popular classification methods because it is easy for humans to interpret. The decision tree uses a data structure tree as a model in the process of determining the class of data. There are three types of nodes in the decision tree" (Putri, 2019): 1. Root nodes, which are nodes that have no edge input and have zero or more edges output. 2. Internal nodes, having exactly one edge input and two or more edges output. 3. Leaf, or terminal node, has exactly one edge input and no edge output.

METHOD
1. There are four steps in the decision tree making process in the C4.5 algorithm, namely : a. Choosing the attribute as the root, based on the highest gain value of the existing attributes. b. Creating a branch for each value, meaning-making a branch according to the number of the highest gain variable values. c. Divide each case in the branch, based on the calculation of the highest gain value and the calculation is carried out after the initial highest gain value calculation, and then the process of calculating the highest gain is carried out again without including the initial gain variable value. d. Repeating the process in each branch so that all cases in the branch have the same class, repeating all processes of calculating the highest gain for each branch case until the calculation process is no longer possible.
The C4.5 algorithm uses a gain ratio parameter to select which variables will be used to form branches in the decision tree. Entropy is difference or diversity. In data mining, entropy is defined as a parameter for measuring heterogeneity (diversity) in a data set. The more heterogeneous a data set is, the greater the entropy value. Entropy is a measure of information theory that can determine the characteristics of the impurity and homogeneity of a data set. From the entropy value, the information gain (IG) value of each attribute is calculated. systematically, entropy is formulated as follows: Formula (1) is a formula used in entropy calculations that are used to determine some of the informative attributes. In general, the C4.5 algorithm for building a decision tree is as follows: a. Select attribute as root. b. Create a branch for each value. c. Divide cases into branches. d. Repeat the process for each branch until all cases on the branch have the same class.
After obtaining the entropy value for a data set, we can measure the effectiveness of an attribute in classifying the data. This measure of effectiveness is called information gain. Can be seen in formula (2).
Formula (2) Calculate the Gain of each calculated attribute entropy by the formula The attribute with thehighest gain value is used as the root Form a node containing these attributes, then repeat the Information Gain calculation until all data are included in the same class. Attributes that have beenselected will not be included.
Generating the Rule from the decision tree Potentially +

Done
Not Potential-

RESULT
The first process of the C4.5 Algorithm is to determine the entropy value. The first step, first determine the total entropy of cases. The formula for finding entropy from customer data can be seen in equation (1)

Calculation of Finding Gain
After getting the results of all entropy, the next step is to calculate the gain of each attribute, the formula for finding gain can be seen in equation (2)   In the table above, we can see that the attribute of Corporate Image has the highest gain, namely 0.6436988, so the Corporate Image will be the root node. Company image has 3 values, namely good, enough, not good. Where good and bad have classified cases into one with a good decision "satisfied" and not good with a "dissatisfied" decision. Meanwhile, if the value is sufficient, more calculations are needed because it still has resulted between "satisfied" and "not satisfied", the calculation is carried out to determine the next root node. In the calculation of entropy and gain in the table above, it can be seen that the highest gain value on Company Image = Enough is the smoothness of water with a gain value of 0.612581382 where smooth water has a current, sufficient and non-current value where the Non-current value has classified the case into one with a decision of "Not satisfied" while the value is sufficient and current is still necessary to recalculate because it still has a value of "satisfied" and "not satisfied" and is used as the root node with the attributes Enough and Current. Then do the calculations in the same way as the tables above until all branches have their respective values so that they can make a decision tree as shown in Figure 1   In the table above, it can be explained that the factors that affect the first node are the company's image, the second node is water smoothness, the third node is service facilities, the fourth node is Price Rates, the fifth node is the location and the sixth node is Service Facilities. Figure 3. Display link read excel, apply model and performance Figure 4 above explains that the validation and accuracy of the C4.5 Algorithm (Decision Tree) will be tested, the decision tree validation is used to see the accuracy of the C4.5 algorithm rule model in predicting customer satisfaction with PDAM Tirta Lihou Totap Majawa Unit by using rapidminer software. Looking for the accuracy value or Accuracy by dragging and dropping Apply Model and Performance on the operator's menu into the process panel, Apply Model serves to learn the trained ExampleSet information that has been used for prediction using this model while Performance is used for statistical evaluation of classification performance and provides a list value criteria of the classification performance. Based on what has been explained above, it can be seen that the data testing was carried out using the apply model and% performance obtained an accuracy value of 98%, so this C4.5 decision tree algorithm model can be categorized as excellent.

Figure 5. Decision Tree View
After calculating and testing the data on each attribute manually or using the RapidMiner Studio 5.3 application with the C4.5 algorithm, the results of the final decision tree pattern are the same. The results of the accuracy level and AUC can be seen and known by clicking the PerformanceVector (Performance) tab, the accuracy results will be shown as shown above. Where the model that has been formed is tested for its level of accuracy by entering or testing the training data from the RapiMiner Studio 5.3 application to test the level of accuracy. With the RapidMiner Studio, 5.3 application, accuracy, class recall, and precision class values are generated. It is explained that the "Satisfied" category produces a class recall of 97.80% and a precision class of 98.31% and the "Not Satisfied" category results in a class recall of 98, 31% and precision class of 96.67%. And the results of the above accuracy from the calculation of the C4.5 algorithm are 98.00% and the value of the AUC (Area Under the ROC Curve) is 0.996.

DISCUSSIONS
From the above results it can be seen that where the model that has been formed is tested for its level of accuracy by entering or testing the training data in the RapiMiner Studio 5.3 application to test the level of accuracy and the results obtained with the data mining method 98, 31% means that the result of using the C4.5 algorithm method is almost 100%, so it can be ascertained that with the Data Mining method the C4.5 Algorithm can determine the level of customer satisfaction.

CONCLUSION
Conclusion In this study is to display the results in the form of customer satisfaction with PDAM Tirta Lihou by displaying data in the form of the highest value according to the values and criteria provided. The value chosen is the first highest value, namely the value of Corporate Image with the final result of 0.6436988. The test carried out with the RapidMiner software using the apply model and% Performance obtained a value of 98.00%. This means that the resulting rule has a level of truth close to 100%. In manual calculations, the results obtained are accurate because it can be proven by the system and displays the same highest results both for the difference, entropy and gain value as well as the highest value for each value. To get more varied results, maybe this research can be developed with other data mining techniques such as Genetic Algorithm, KNearest Neigboard Algorithm to find the rules or model approach to be achieved. From the rule model approach obtained, it needs special attention for PDAM Tirta Lihou Unit Totap Majawa to see and make decisions which variables should be considered to support the level of customer satisfaction so that customers will feel satisfied and have no complaints