The Application of C4.5 Algorithm for Selecting Scholarship Recipients

- The scholarship program is one of the promotional techniques used by many universities, and the right scholarship award will certainly be an attraction for many people. STMIK Pelita Nusantara is one of the universities that organizes a scholarship program. In the current difficult economic conditions, the scholarship program is the target of many prospective students who want to continue their education in higher education. However, the absence of tools to process large amounts of data make determining scholarship recipients less effective and time-consuming. This situation is seen by the fact that some students are still unable to maintain the scholarships they receive. In the research, a classification model was proposed using the C4.5 algorithm approach by utilizing past data to facilitate the decision making of the scholarship program. This classification process produced a decision tree that could be used as a decision-making tool. Scholarships were awarded based on several criteria: academic potential, vocational potential, parents’ income, number of dependents, and employment status. Based on the data processing results of students who apply for scholarships in 2020 with predetermined criteria, the highest root is obtained. It consists of node 1 for academic potential, node 1.1 for vocational potential, and node 1.2 for parental income. The resulting decision tree model is expected to help to make decisions quickly and on target.


I. INTRODUCTION
Current technological developments have been used as a tool in various fields, such as education, to enable teachers to quickly, precisely, and accurately process student data. Hence, the purpose of a job can be achieved effectively and efficiently (Hidayad, Defit, & Sumijan, 2020). Data processing by involving the technology in question will certainly be easier when associated with the right data processing model. One of the ways to process large amounts of data is by data mining techniques.
Data mining is an implementation model applied to look for patterns based on previous data to extract knowledge from large amounts of data (Guntur, Santony, & Yuhandri, 2018). The purpose of data mining is usually predictive (Dardzinska & Zdrodowska, 2020). According to Daryl Pregibon, data mining is a mixture of statistics, artificial intelligence, and database research which is still developing (Sulastri & Gufroni, 2017). Various techniques are available in data mining for knowledge extraction, including prediction, description, classification, estimation, association, grouping and classification (Ariawan, 2019;Afrianto, Suseno, & Warsito, 2020).
According to Azmi and Dahria (2013), data mining is an iterative process that requires human interaction to find a new pattern or model that can be generalized for the future and useful to carry out an action. Data mining has the concept of capturing and storing data, converting raw data into information and information into knowledge (Condrobimo, Sano, & Nindito, 2016).
Data mining is also called Knowledge Discovery in Database (KDD) or pattern recognition (Hidayad et al., 2020). Data mining includes collecting and using historical sources and data to find regularities, patterns, or relationships in large datasets (Santoso, Hariyadi, & Prayitno, 2016). The process includes understanding the application field, making target data determined from the raw data contained in the database, and preprocessing and cleaning data (Virgo, Defit, & Yunus, 2020). The main goal of KDD is to extract high-level knowledge from lowlevel information (Putra & Defit, 2019). The KDD process generally consists of the following steps: data selection, data transformation, exploration like extraction of knowledge from data, and interpretation of results (Dardzinska & Zdrodowska, 2020). Data mining has also been implemented to predict students' study periods. The test results show that the error rate in predicting students' study period is only 5% (Haryati, Sudarsono, & Suryana, 2015).
Next, classification is finding a model or function that differentiates concepts or data classes. It aims to predict the class of objects whose class labels are unknown based on training data analysis (data objects with known class) (Afrianto et al., 2020). It is also the most commonly applied data extraction technique to predict categorical attribute values (discrete or nominal). It uses a set of previously classified examples to develop a model to classify entire population records with a decision tree or neural network-based classification algorithm. The process involves two stages, learning and classification. At the learning stage, the classification algorithm analyzes the training data. At the classification stage, test data is used to estimate the accuracy of the classification rules. The rules can be applied to the new data tuples if the accuracy is acceptable. The classifier training algorithm uses pre-classified examples to determine the set of parameters required for true discrimination and encodes these parameters into the model called classifier (Bedregal-Alpaca, Cornejo-Aparicio, Zárate-Valderrama, & Yanque-Churo, 2020).
Moreover, a decision tree is a data mining method used for classification. Decision tree classification is a simple classification technique that is widely used. Previous researchers have developed various decision tree algorithms over several periods by improving the performance and ability to handle various data types. Examples are the Chi-squared Automatic Interaction Detector (CHAID), Classification and Regression Tree (CHART), Iterative Dichotomiser 3 (ID3), C4.5 algorithm, C5.0 algorithm, Hunt's algorithm, and Ordinal Class Classifier (OCC) (Effendy & Purbandini, 2018).
In the research, the C4.5 algorithm is used. C4.5 algorithm is a classification technique using entropy and profit information as a separator in a decision tree (Florence & Savithri, 2013). The C4.5 algorithm constructs a decision tree from training data in the form of cases or records (tuples) in the database (Riandari & Simangunsong, 2019). The C4.5 algorithm is also used to build a decision tree. There is a study comparing the C4.5 algorithm and the CART algorithm in the student grade classification. It explains that the C4.5 algorithm has a higher accuracy value of 85,61%, while the CART algorithm has 84,95% (Rahmayuni, 2014). Moreover, the C4.5 algorithm generates a decision tree, which provides input in the form of a classification sample. The application of the C4.5 algorithm functions to produce a level of data accuracy as a dataset containing large amounts of data (Fiandra, Defit, & Yuhandri, 2017).
Scholarships are one of the leading programs offered by many universities. In the current difficult economic conditions, scholarship programs target many prospective students who want to pursue higher education. However, there are still some difficulties in determining the eligible prospective students due to the many applicants and the variables assessed in its implementation. Besides that, there are no tools that determine the selection. It takes a long time, and the possibility of inaccurate selection results is quite high. Based on the previous explanation of the data, data mining can be used to extract student data based on the characteristics of the selection results for scholarship recipients. The classification algorithm used is a decision tree with the C4.5 algorithm approach. Then, the classification results in the form of a decision tree that can be used as a tool in making decisions in the process of receiving scholarships quickly and staying on target. In this way, it is expected to help to make decisions quickly and on target.

II. METHODS
In tree formation with the C4.5 algorithm, there are several stages. Training data is usually taken from historical data that has occurred previously and grouped into certain classes. Second, it determines the roots of the tree. The root will be taken from the selected attribute by calculating the acquisition value of each attribute. Then, the highest value will be the first root (Dhika & Destiawati, 2015).
Analyzing the C4.5 algorithm is a stage after the problem to be analyzed is found. Then, the existing data will be processed. So, the C4.5 Algorithm design will be carried out after all existing data are processed, and all required data are complete. Data processing is carried out in accordance with the KDD stages (Rahmayuni, 2014).
First, it is selection. The object of the research is students who apply for scholarships in 2020. The research is carried out at STMIK Pelita Nusantara Medan. Then, the data collection uses observation and interviews with implementers. The data obtained are qualitative, containing information on each variable determined by the college in receiving scholarships, such as the value of academic potential, potential vocational test, parents' income, number of dependents, and employment status. The number of new students who apply for the scholarship that year is 150 people.
Second, there is preprocessing or cleaning.
After the data from the selection results are obtained (the data of prospective scholarship recipients in 2020, amounting to 150), the selection data proceeds to the data cleaning stage to remove inconsistent/noise and with the same value data. It can be said that this stage discards the data of prospective scholarship recipients with the same score as the other potential recipients in each criterion. Different patterns will be searched at this stage, if the same pattern is found, only one representative pattern will be left, and the rest will be cleaned. So, the final result of the cleaning stage gets 16 different patterns from 150 participant data in the scholarship acceptance process. Hence, the final result of this stage obtains 16 people from the previous data, amounting to 150 people. Third, it is transformation. The preprocessed qualitative data will be grouped and transformed into an appropriate assessment form to be processed into data mining. In the research, the data are converted into quantitative form. This process makes it easier to define during testing.
Fourth, in data mining, the data of 16 students will be processed in the C4.5 algorithm data classification. It is carried out by making a decision tree to identify the conditions for objectively giving scholarships by looking at the value of each attribute of the new applicants for the scholarship (academic potential, potential vocational test, parents' income, total dependents, and employment status). It is based on the highest gain value of the existing attributes to choose an attribute as the root. Equation (1) is used to calculate gain. It shows S as a case set, A as features, n as the number of partitions S, and pi and the proportion of Si to S.
(1) Meanwhile, Equation (2) calculates the entropy value in the entropy (total) formula. It shows as the number of partitions attribute A, | Si | as a number of cases on the i-th partition, and | S | as a number of cases in S.

Entropy (total) = (2)
In making a decision tree, it must count the number of cases, the number of cases for the decision of "Accepted" (S1), the number of cases for the decision of "Rejected" (S2), and cases divided based on the attributes of academic potential, vocational potential, parents' income, number of dependents, and employment status. Then, the gain will be calculated for each attribute. In making a decision tree, there are several stages. It determines the attribute as the root and calculates the value of the attribute gain information. It is based on the highest gain value of the existing attributes to select the attribute as the root. An entropy value is needed to determine the highest gain.
Fifth, the purpose of interpretation or evaluation is to objectively obtain the results of the decision analysis of students who receive scholarships. It is based on the attributes of academic potential, vocational potential, parents' income, number of dependents, and employment status. The data will be analyzed, and the method will be implemented to get the desired results.

III. RESULTS AND DISCUSSIONS
Based on the test data in Table 1, the attribute as the root is determined. Then, the value of the attribute acquisition information is calculated. It is based on the highest gain value of the existing attribute to determine the attribute as root. In determining the highest gain value, the entropy value is needed. Then, to find the entropy of each case, the total number of sub-criteria values is calculated. The sub-criteria (see Table 2) in finding entropy is transformed into the following form. The calculation of the entropy value for each attribute uses Equation (3). Entropy (total) calculates the total value of the decision. The "Accepted" (S1) is 8, and "Rejected" is 8. Hence, the total number of cases is 16.

Entropy (total) = log 2 Pi
(3) As seen in Table 1, the value of academic potential has a good score of 2 cases in the attributes of academic potential. Then, the rejected value has a good score of 0 cases. With a total of 2 cases, Equations (4), (5), and (6) calculate the entropy of each case. The same way is done for the other attributes.  When all the entropy and gain values for each attribute have been calculated, the calculation results are recorded in Table 3. The calculations in Table 1 show that the attribute with the highest gain value is academic potential, with a gain value of 0,39316. So, this attribute is used as the root method with the others. The attribute with a lower value can be said to be "Rejected". However, attributes with enough value still have to be recalculated. Node 1 of the decision tree can be seen in Figure 1.
Furthermore, a solution is carried out to calculate Node 1.1 as the root. It is done in the same way as calculating the entropy value of the remaining attributes, such as vocational potential, parents' income, number of dependents, and employment status. After entropy is calculated, the gain for each attribute is measured. Entropy (value, academic potential, enough) is calculated with the following equations.
When all entropy values and gain values have been calculated, the calculation results are put in Table 4. It can be seen that the highest gain attribute is the vocational potential, with a value of 0,57095. Thus, it can be interpreted that the vocational potential can become the next root node so that a decision tree is formed in Figure 2.
Next, the research calculates Node 1.2 as the root. It calculates the entropy and gain values in the same way, using the entropy value of the remaining attributes of parents' income, total dependence, and employment status. After calculating entropy, the gain is measured for each attribute. Entropy (Vocational potential, C) has the following equations.
After the entropy and gain values are calculated, the results of these calculations are put in Table 5. It can be seen that the highest gain attribute is the parents' income with a value of 1. So, it can be the next root node. The value of the T attribute is high, and the value of the R attribute is low. The decision tree formed can be seen in Figure 3.
The rules obtained based on the decision tree formed are as follows: IF Academic Potential = Good THEN Decision = Accepted, IF Academic Potential = Sufficient AND Vocational Potential = Good THEN Decision = Accepted, IF Academic Potential = Sufficient AND Vocational Potential = Sufficient AND Parents' Income = Low THEN Decision = Acceptable, IF Academic Potential = Enough AND Vocational Potential = Enough AND Parents' Income = High THEN Decision = Rejected, IF Academic Potential = Sufficient AND Vocational Potential = Low THEN Decision = Rejected, and IF Academic Potential = Low THEN Decision = Rejected.  After the rules are obtained from the C4.5 algorithm classification process, further testing is carried out using one of the data mining applications, Rapid Miner. A decision tree and rules are obtained from the test results with the Rapid Miner application, which can be seen in Figure 4. The branch formed in a graph is from the same dataset as Table 1. The roots formed from the application test show the same shape as the manual calculation performed with the C4.5 algorithm in Figure 3. Then, in Figure 5, the rules are formed in a description.

IV. CONCLUSIONS
From the discussion results, it is concluded that the decision tree with the C4.5 algorithm can be used to classify the attributes used in analyzing prospective scholarship recipients. It can be a tool in making decisions about scholarship recipients and shorten the decision-making time. So, it can analyze prospective students who entitle to a scholarship with the most influential attributes, namely the academic potential, vocational potential, and parents' income. There are three influential variables from five variables used in selecting prospective scholarship recipients. The utilization of these three variables is based on branch in accordance with Figure 3, which is formed and translated into rules. It can shorten the timeline used in the selection because it has known the rules in the assessment. Hence, the assessment can be started on the criteria with the main priority or the highest root and only carried out on the influencing criteria. Then, the results obtained can be more efficient and on target. It is hoped that future studies add variables related to the expected socioeconomic status, such as the parents' occupation, electricity bills, and homeownership status, to expand the research results. Hence, the scholarship recipient can be the right person regarding academic and socioeconomic status. At the same time, increasing the number of variables will allow the algorithm to work with larger data sets. In addition, future research can use other approaches in classifying scholarship patterns to determine the performance of each algorithm used, so universities can use decision-making tools that best suit their needs.