Data-Driven Approach for Credit Risk Analysis Using C4.5 Algorithm

- Credit risk is bad credit, resulting in bank losses due to non-receipt of disbursed funds and unacceptable interest income. However, credit services still have to be done to achieve profit. The absence of an approach that can assist in making policies to reduce credit risk makes the risk opportunities even more significant. So, data processing techniques are needed that produce information to be used as the basis for policies in triggering credit risk with data mining. The research presented an application of data mining as a credit risk approach considering the ability of data mining techniques to extract data into useful information with the C4.5 algorithm. The research used a sample of 30 data banks with 6 factors (credit growth, net interest margin, type of bank, capital ratio, company size, and bank compliance level). Credit risk was evaluated by making a decision tree and a RapidMiner test application. The results show that credit growth is the main factor causing credit risk, followed by bank compliance level, net interest margin, and capital ratio. Based on the results obtained, the C4.5 algorithm can be used in analyzing credit risk with results that are easy to understand and can be used as useful information for banks.


I. INTRODUCTION
Banks are one of the sectors with a very important role in moving the national economy (Lihani, Ngadiman, & Hamidi, 2013). However, in carrying out their business, banks often experience upheavals. Those are not only caused by the world economy but also by the bank itself related to borrowing or credit (Hakim & Oktaria, 2018;Saputro, Sarumpaet, & Prasetyo, 2019).
Moreover, many services are provided by the banking world, one of which is lending services. The existence of credit in the banking world is a form of service that is usually carried out in banking activities (Hanif, 2015). Moreover, many services are provided by the banking world, one of which is credit services. The existence of credit in the world of banking is a form of service that is usually carried out in banking activities. In this case, the research usually only analyzes one bank or the same type of bank that provides credit services based on the pattern of debtors who have been given credit before (Aji & Manda, 2021;Cristina & Artini, 2018;Wijaya & Tiyas, 2019).
Many factors can cause credit risk. So, it is essential to pay attention to the credit risk that may be experienced by banks, such as credit growth, net interest margin, bank type, capital ratio, company size, and level of compliance (Hakim & Oktaria, 2018). Furthermore, it is necessary to take advantage of technological developments in producing information quickly accompanied by data processing techniques to analyze data related to these factors correctly to produce information that can be used with data mining techniques. Hence, it can avoid upheaval in the banking world related to credit and assist in making policies to avoid credit risks that may arise.
Data mining refers to the process of extracting knowledge from big data (Wang, Zhou, & Xu, 2019). Data mining is also mentioned as a series of processes to explore added value in the form of knowledge that has not been known manually from a data set (Susanto & Indriyani, 2019). Data mining is also the mining or discovery of new information by looking for certain patterns or rules from a very large amount of data (Fauzi, Marpaung, & Pardede, 2018). Data mining is also the process of extracting data into information that has not been conveyed before. With the proper techniques, the data mining process will provide optimal results (Riandari & Sihotang, 2020). Data mining is also defined as a process that employs one or more computer learning techniques (machine learning) to analyze and extract knowledge automatically (Zulfami, 2017). According to Le (2022), data mining is a technology that is an organic combination of data warehouse technology and comprehensive database integration technology. It supports various algorithms to meet different mining needs. Data that have gone through the Knowledge Discovery in Database (KDD) process is also known as control data (data-driven). With the rapid development of data mining, many studies are used to predict a case, especially banking (Subarkah, Pambudi, & Hidayah, 2020).
Various techniques are available in data mining for knowledge extraction, including predicting, estimating, associating, clustering, and classifying (Ariawan, 2019). Similarly, according to Harlina (2018), data mining is divided into several groups based on the tasks that can be done: description, estimation, prediction, classification, clustering, and association. First, descriptions of patterns and trends often provide possible explanations for a pattern or trend. Second, estimation is almost the same as classification, except that the target variable is more numerical than categorical. The model is built using a complete record that provides the value of the target variable as the predicted value.
Furthermore, the estimated value of the target variable is made based on the value of the predictive variable. Third, prediction is almost the same as classification and estimation, except that the value of the results will be in the future. Fourth, in the classification, there is a categorical variable target. Fifth, clustering is grouping records and observing and forming classes of objects with similarities with one another and dissimilarities with records in other clusters. Last, the task of association in data mining is to find attributes that appear at one time.
Classification is also one of the main tasks of data mining. Classification means analyzing data patterns in the training set to find an accurate description model of each category and generalizing the known structures to apply them to new data. The classification procedure includes data acquisition, feature selection, model selection, training, and evaluation. Data to be used for training and testing must be collected beforehand. Then, feature selection is influenced by previous feature descriptions from the data set. Classifiers must be trained to define system parameters. Usually, there are several repetitions of the previous procedure based on the results of the previous evaluation to create better results (Liu, Jin, & Liu, 2011). Classification is the process of finding a model or function that distinguishes concepts or data classes. It predicts the object class whose class label is unknown based on the analysis of training data (data objects whose class is known) (Afrianto, Suseno, & Warsito, 2020). It is the most commonly applied data extraction technique to predict categorical attribute values (discrete or nominal) (Bedregal-Alpaca, Cornejo-Aparicio, Zarate-Valderrama, & Yanque-Churo, 2020).
One of the extraction techniques that is often used in prediction and classification is the C4.5 algorithm. The C4.5 algorithm is a classification algorithm with a decision tree technique that is wellknown and preferred because of its advantages. For example, it can process numeric (continuous) and discrete data, handle missing attribute values, and generate rules that are easy to understand and the fastest among other algorithms.
In addition, the algorithm is a collection of commands written systematically to solve mathematical logic problems. Understanding the C4.5 algorithm can be used to control a device. Meanwhile, the decision tree can be interpreted as a powerful way of predicting or clarifying. Decision trees can divide large data sets into sets. There are many nodes in the decision tree and a number of nodes representing tests on a particular attribute, which have spread to the sample size in the lowest category of leaf nodes. There are many types of decision trees. The most famous algorithm in the industry is developed by the Rosquin Institute, which is mainly used to generate decision trees (An & Zhou, 2022).
The C4.5 algorithm is also one of the algorithms for converting big facts into a decision tree that represents the rules. The purpose of forming a decision tree in the C4.5 algorithm is to make it easier to solve existing problems. There are stages in turning the C4.5 algorithm into a rule (Kurniawan, Anggrawan, & Hairani, 2020). Thus, the decision tree is a classification method that uses a tree structure representation where each node represents an attribute, the branch represents the value of the attribute, and the leaf represents the class. The top node of the decision tree is called the root (Hozeng & Aisa, 2016).
The decision tree approach can potentially improve prediction accuracy as it plays a promising role in decision-making (Ramos, Faria, Morais, & Vale, 2022). Another previous research predicts lung cancer risk using the support vector machines classification technique, C4.5, and Naive Bayes algorithms for health services analysis. It concludes that the C4.5 algorithm predicts better (Pradeep & Naveen, 2018). Another previous research applies data mining using the decision tree method for predicting credit risk determination and focusing on building a BRI credit scoring model with a decision tree technique. It concludes that the decision tree technique can build a model with an objective and produce an easy-to-understand model and a high level of accuracy (Hozeng & Aisa, 2016).
Based on the previous explanation, a credit risk evaluation is carried out at the lending/credit service provider in the bank. The research is based on the factors that cause credit risk with a data mining approach with the C4.5 algorithm in pre-processing data according to the needs of the approach used. It helps the research to run in accordance with the expected goals. The research aims to see how the application of data mining with the C4.5 algorithm analyzes credit risk. The results obtained are in the form of a decision tree that describes the pattern of causes of bad loans. It is expected to provide information about the pattern of causes of bad loans based on the variables that are beneficial to banks. These results are also expected to be a tool for banks in providing credit services to take appropriate policies so that credit risk does not occur for non-bank service providers so that service providers cannot only survive in running their business but also increasing company productivity and achieving the company's main goals.

II. METHODS
In research, the data mining technique for performing classification is the C4.5 algorithm. The C4.5 algorithm constructs a decision tree from training data as cases or records (tuples) in the database (Riandari & Simangunsong, 2019). The C4.5 decision tree algorithm was proposed by JR Quinlan in 1993 (Wang & Gao, 2021). It is a decision tree making algorithm based on the ID3 algorithm. The C4.5 algorithm overcomes the shortcomings of the ID3 algorithm in the application. In the C4.5 algorithm, the information gain rate is used as the basis for selecting test attributes (Wang & Gao, 2021). The C4.5 algorithm is chosen in the research because it can make predictions by providing an ideal level of accuracy in predicting credit risk.
The research has several stages, as seen in Figure 1. The initial stage of research is to formulate the problem in accordance with the problems that occur and the goals to be achieved. Then, the research focuses on how to apply data mining with the C4.5 algorithm in analyzing credit risk. After the formulation of the problem, the research objectives are formed to answer the predetermined problem formulation. Finally, the research objective is to apply data mining with the C4.5 algorithm in credit risk analysis. Furthermore, data and information collection on data mining with the C4.5 algorithm are carried out through literature studies through books, previous research, and other media related to research.
The implementation of the C4.5 algorithm analysis has several stages. After the problem to be analyzed is found, it process data related to the research objectives. Implementing the C4.5 algorithm is carried out after the data to be processed is complete and in accordance with the needs. The data processing process before being processed with the C4.5 algorithm is carried out according to the KDD stages. First, in data selection, the research object is 30 banks providing loan/credit services as samples. The qualitative data contain information about every factor that can lead to credit risks, such as credit growth, net interest margin, bank type, capital ratio, company size, and bank compliance level. Second, it is pre-processing/ data cleaning. After 30 samples of bank data have been taken, which are equipped with information from each factor, the data from the selection are processed to the data cleaning stage to eliminate data that are not appropriate/noisy and have the same value. This stage discards bank data with no value and same value in all factors used, such as alternatives A and B which are different alternatives but have the same value in all the variables used. So, one variable is discarded. At this stage, different patterns are searched. If a similar pattern is found, only one representative pattern will be left, and the rest will be cleaned. Because none of the samples used are noisy and have the same value, the final result of this stage finds the same amount of data as the sample, as many as 30 data bank samples. Third, in transformation, the previously processed data (qualitative data) are grouped and transformed into an appropriate assessment form to be processed in data mining. In the research, the data were converted into quantitative form to make it easier to define during testing. Based on the 30 data samples, the classification stage is carried out using the C4.5 algorithm approach, which produces a decision tree. It identifies patterns that cause credit risk in banks that provide loan/credit services based on existing factors. (1) To make a decision tree, the first step that needs to be done is to count the total number of cases: the number of cases with a low credit risk statement (S1) and the number of cases with a high credit risk statement (S2). Then, these cases are divided based on credit risk factors, such as credit growth factors, net interest margin, bank type, capital ratio, company size, and bank compliance level. It is calculated for each case obtained on each factor.
There are several steps involved in making a decision tree. It specifies the factor as the root and calculates the attribute value to get the information. It is done based on the highest gain value of all the existing factors to determine the factor as the root. The entropy value is needed to determine the highest gain.
According to Ginting, Kusrini, and Taufiq (2020), the purpose of the evaluation stage is to obtain the results of the analysis regarding banks providing loan/credit services. In addition, it is to see the potential to accurately assess credit risk based on credit growth factors, net interest margin, bank type, capital ratio, company size, and bank compliance level. Finally, the results are tested with one of the data mining testing applications, RapidMiner.

III. RESULTS AND DISCUSSIONS
Based on the KDD stages, it starts from the data selection, pre-processing, transformation, data mining, and evaluation stages. Then, the testing data obtained are used in the discussion, as seen in Table 1. The data presented in Table 1 consist of 30 bank data with an assessment of each attribute. The assessment of the attributes consists of five categories, namely high, currently, low, big, and small. The data have gone through the stages of selection, pre-processing, and transformation. The data also have different pattern and are ready to be analyzed using the C4.5 algorithm. Based on Table 1, the initial step to be carried out is to determine the factor as the root and calculate the information value of the factor acquisition. Then, it is based on the highest gain value of each factor to determine the factor as the root. Next, it needs entropy value to determine the highest gain value. Meanwhile, the total value for each factor will be calculated to find the entropy of each case. The process of finding the value of each factor in the case is transformed into the following form in Table 2. It shows the transformation data from the five categories of attribute ratings to make it easier to remember the value of each attribute in the assessment process. Entropy (total) calculates the total value of the information on the low-risk level (S1), which has 8 cases. Meanwhile, the information on the high-risk level has 11 cases. Then, the total number of cases is 30 cases. Entropy (total) is calculated using Equation (2) as follows.
Entropy (total) = The next step is to calculate the entropy value for each factor used in analyzing credit risk. First, it is the credit growth factor. It is necessary to pay attention to Table 1 to calculate the entropy of credit growth. In Table 1, it can be seen that the credit growth with a high score is 8 cases. They are divided into a low credit risk level with 8 cases and a high credit risk level with 0 cases. For the number of cases with a moderate factor, it has 13 cases with a low credit risk level of 11 cases and a high credit risk level of 11 cases. Furthermore, for credit growth with low factor values, it has 9 cases with 0 cases of a low-risk level and 9 cases of high credit risk. Here is the entropy of credit growth.
From the results of the assessment, it can be seen that the entropy value of the credit growth attribute with two categories owned, namely the high and low categories. It gets an entropy value of 0 so that the search process is complete. It can be concluded that the credit growth attribute with a high level credit risk category is low and vice versa in the low category. Low level credit risk is in high category. Meanwhile, in the currently category, the entropy value obtained is 0,61938. Hence, further analysis is needed because the assessment process has not been completed, and the result is not 0.
The following calculation is for the net interest margin. The same action is taken for this factor. From the results of calculating the entropy value, it is known that the net interest margin has two categories, namely high and low. It produces a value that is not equal to 0 so that the calculation has not been completed. Then, further analysis will still be carried out in the entropy search process at the next root. Here is the calculation of entropy of the net interest margin.
The following calculation is bank type. From the results of calculating the entropy value, it is known that the bank type attribute with two categories (country and public) produces a value that is not equal to 0. So, the calculation has not been completed, and further analysis will still be carried out in the entropy search process at the next root. Here is the calculation of entropy of the bank type.
The next calculation is a capital ratio. From the results of calculating the entropy value, the capital ratio attribute with two categories (big and small) produces a value that is not equal to 0. Hence, the calculation has not been completed, and further analysis will still be carried out in the entropy search process at the next root. Here is the calculation of entropy of the capital ratio.
The next calculation is company size. From the results of calculating the entropy value, the company size attribute has two categories, namely big and small. It produces a value that is not equal to 0 so that the calculation has not been completed. Then, further analysis will still be carried out in the entropy search process at the next root. Here is the calculation of entropy of company size.
The next calculation is the bank compliance level. From the results of calculating the entropy value, the compliance level attribute with two categories (high and low) produces a value that is not equal to 0. So, the calculation has not been completed, and further analysis will still be carried out in the entropy search process at the next root. Here is the calculation of entropy of the bank compliance level.
Then, the researchers look for the gain value for each attribute. The first calculated gain value is the gain (total, credit growth). The calculation is done by adding up the entropy value for each category in credit growth. The gain value search determines the root. The result can be seen as follows.
Next, the calculation of the gain value on the net interest margin (total, net interest margin) is carried out. The same steps are taken to calculate the gain value for this factor. It adds up the entropy values for each category on the net interest margin that has been analyzed in the previous stage. The result can be seen as follows.
Then, the calculation of the gain value on the bank type (total, bank type) is also carried out. The same steps are taken to calculate the gain value for this factor by adding up the entropy values for each category on the bank type analyzed in the previous stage. The result can be seen as follows.
Next is the calculation of the gain value on the capital ratio (total, capital ratio). It also adds up the entropy values for each category on the capital ratio that has been analyzed in the previous stage. The result can be seen as follows.
Then, the calculation of the gain value on the company size (total, company size) will be carried out. The same steps are taken to calculate the gain value for this factor, namely by adding up the entropy values for each category on the company size that has been analyzed in the previous stage.
Then, the calculation of the gain value on the bank complience level (total, bank complience level) is carried out. It also adds up the entropy values for each category on the bank complience level that has been analyzed in the previous stage. The result can be seen as follows.
After the entropy value search process has been completed and all entropy values and gain values are known for each factor, the results of each entropy value for each factor category and the gain value for each factor are shown. It is easier to find out the comparison of the gain values obtained. The factor with the highest gain value will be the first root. The data are presented in Table 3.
In Table 3, all the factors in credit risk have obtained the calculation results of the entropy value and gain value. The factor with the highest gain value is credit growth, with a gain value of 0,67968. Hence, credit growth becomes the root. Based on the entropy value of each factor value, it is known that when credit growth has a high value, the credit risk has a low value. Meanwhile, when the credit growth has a low value, the credit ratio has a high value. The first node formed can be seen in Figure 2. For the next node, it is done in the same way as in the first node search. The difference is that the number of credit growth factor cases with high and low values in the next node search is no longer counted in cases or eliminated. It is because they have been completed or have a value of 0. It can be said that information has been obtained. So, for the following calculation, only 13 cases are counted for credit growth with a moderate value. The same thing is repeated until all values for each factor have been completed or are worth 0. In the research, the credit risk rules are obtained: credit growth as node 1, bank compliance level as node 2, net interest margin as node 3, and capital ratio as node 4. Next, the research tests the data mining application with the RapidMiner. The data from 30 samples are shown in Table 1. The data are tested in the RapidMiner application by establishing a relationship between the database and the operators, as shown in Figure 3. Figure 3 describes the process before classification with RapidMiner. In this section, the bank data database, as shown in Table 1, is imported into the worksheet to connect the tested database with the classification operator, namely the decision tree. It is necessary to set two operator roles and pull the decision tree operator into the worksheet. The process of connecting the tested dataset with the decision tree operator is carried out by pulling the port or wire, which can be seen in the figure that connects the dataset to the operator set role to decision tree operator.
The following rules are formed based on the graph formed from the RapidMiner test results in Figure 4. First, if credit growth is high, the credit risk is low. Second, the credit risk is low if credit growth is medium and the bank compliance level is high. Third, if credit growth is medium, the bank compliance level is low, and the net interest margin is low, credit risk is high. Fourth, if the credit growth is medium, the bank compliance level is low, the net interest margin is high, and the capital ratio is large, the credit risk is low. Fifth, if credit growth is medium, bank compliance level is low, net interest margin is high, capital ratio is small, and company size is large, credit risk is low. Sixth, if credit growth is medium, bank compliance level is low, net interest margin is high, capital ratio is small, and company size is small, credit risk is high. Last, if credit growth is low, the credit risk is high.
The research provides different results from previous studies, although there are similar variables in analyzing credit risk. For example, previous research mentions that bank size, leverage, bank age, and competing banks affect credit risk (Syamlan & Jannah, 2019). Another previous research states that financial inclusion on bank credit risk where an increase in the financial inclusion index will increase credit risk (Ghasarma, Muthia, Umrie, Sulastri, & Arianto, 2019). In the research, the variables that affect the credit risk of a bank are credit growth, bank compliance level, net interest margin, and capital ratio.

IV. CONCLUSIONS
Based on the results, it can be concluded that the C4.5 algorithm can be applied to analyze credit risk with predetermined variables. Therefore, the research results can provide information for banks in minimizing the occurrence of credit risk and serve as a tool in policymaking in overcoming and reducing the number of occurred credit risks. Of the six variables used in analyzing credit risk, only four have an effect on the occurrence of credit risk, namely credit growth, bank compliance level, net interest margin, and capital ratio. Based on the results, it will be easier for banks to analyze the possibility of credit risk because the analysis only needs to be done on the empathy variable. Hence, banks can make policies to help to overcome credit risk problems.
The drawback of the research is that a wider variable that can affect credit risk is needed. It is hoped that further research can re-analyze the factors that may be credit risk factors other than those used in one of the world economic research factors. In addition, further research can use other approaches in classifying credit risk to gain knowledge about each algorithm used. Hence, credit service providers can choose an algorithm that suits their needs.