Diseases Classification for Tea Plant Using Concatenated Convolution Neural Network

Plant diseases can cause a significant decrease in tea crop production. Early disease detection can help to minimize the loss. For tea plants, experts can identify the diseases by visual inspection on the leaves. However, providing experts to deal with disease identification may be very costly. The machine learning technology can be implemented to provide automatic plant disease detection. Currently, deep learning is state-of-the-art for object identification in computer vision. In this study, the researchers propose the Convolutional Neural Network (CNN) for tea disease detections. The researchers focus on the implementation of concatenated CNN, namely GoogleNet, Xception, and Inception-ResNet-v2, for this task. About 4727 images of tea leaves are collected, comprising of three types of diseases that commonly occur in Indonesia and a healthy class. The experimental results confirm the effectiveness of concatenated CNN for tea disease detections. The accuracy of 89.64% is achieved.

Abstract-Plant diseases can cause a significant decrease in tea crop production. Early disease detection can help to minimize the loss. For tea plants, experts can identify the diseases by visual inspection on the leaves. However, providing experts to deal with disease identification may be very costly. The machine learning technology can be implemented to provide automatic plant disease detection. Currently, deep learning is state-of-the-art for object identification in computer vision. In this study, the researchers propose the Convolutional Neural Network (CNN) for tea disease detections. The researchers focus on the implementation of concatenated CNN, namely GoogleNet, Xception, and Inception-ResNet-v2, for this task. About 4727 images of tea leaves are collected, comprising of three types of diseases that commonly occur in Indonesia and a healthy class. The experimental results confirm the effectiveness of concatenated CNN for tea disease detections. The accuracy of 89.64% is achieved.

I. INTRODUCTION
T EA (Camellia sinensis) is one of the major agricultural commodities in Indonesia. However, some of the tea clones are susceptible to pests and diseases. In Indonesia, the main disease that frequently attacks tea plants is Exobasidium vexans Massee causing a disease called blister blight, leafhoppers (Empoasca sp.), and the looper caterpillar (Hyposidra talaca). The pests and diseases have different characteristics that can be distinguishable from the leaves. However, experts are still required to manually identify them since some pest and disease symptoms may look similar.
Nevertheless, providing enough experts to deal with vast areas of plantations is very costly and impossible.
The machine learning technology is useful to develop a device for the automatic detection of plant diseases. The application can help in early disease detection to minimize the risk of crop failure. It can also be used as inputs for sorting the harvest production to identify the quality of tea. Plant disease detection can be categorized as classification tasks in machine learning. Classification is a grouping of data for each target class. Classification algorithms are usually trained in a supervised manner. For supervised learning, the relations between the features of the data and their class labels are assumed to follow the presumed classification algorithms. Then, during training, the optimum sets of hyper-parameters that minimize the loss function, such as the mean squared error of the model, are selected. Thus, the challenges in classifiers with good performances are to find the best features or classification models.
For object recognition tasks, there have been various studies that propose good features for object detection or classification. The traditional machine learning techniques are usually used as input to the algorithm. The examples of these techniques are Scale Invariant Feature Transform (SIFT), Speed up Robust Feature (SURF), and Histograms of Oriented Gradients (HOGs). SIFT is very efficient in object recognition applications, but it requires a large computational complexity [1]. SURF technique has performed faster than SIFT and detected points without reducing. HOGs are feature descriptors for object detection [2]. HOG counting calculates gradients in parts of an image [3].
These three examples of traditional techniques require complex calculations and are more challenging to Cite this article as: D. Krisnandi, H. F. Pardede, R. S. Yuwana, V. Zilvan, A. Heryana, F. Fauziah, and V. P. Rahadi "Diseases Classification for Tea Plant Using Concatenated Convolution Neural Network", CommIT (Communication & Information Technology) Journal 13(2), 67-77, 2019. apply to online applications. However, most of these features are computationally complex.
For classifiers, Support Vector Machine (SVM) is arguably one of the most popular classifiers for object recognition before the era of deep learning. Reference [4] proposed SVM for the recognition and detection of tea leaf diseases. Then, Ref. [5] used SVM combining K-Nearest Neighbor (KNN) with geometric moment invariant to increase the results of image recognition. Moreover, combining the SVM with Kernel Principal Component Analysis (KPCA) reduces the dimensional feature vector [6]. However, SVM is also usually tweaking using kernel functions to find good performance, and this is not always easy to find.
Currently, deep learning is a popular technique for many tasks for object recognition. Reference [7] reviewed object detection frameworks that use deep learning. They focused on typical generic object detection architectures with modifications to improve performance. Then, Ref. [8] used CaffeNet architecture to recognize plant disease. They altered the last layer and the output of the softmax layer to support 15 classes. Deep learning models nonlinear relations between data and their class labels used a stacked multi-layer perceptron. Hence, it theoretically could fit any functions if there were enough depth and number of neurons. Therefore, even when it was given simple and raw features, the studies found that deep learning could achieve good performance.
Machine learning methods have been used in some studies for plant disease detection. Disease classification of plants was carried out by Ref. [9] on the images of cucumber leaves. They used K-means clustering and Sparse Representation (SR). AlexNet and VGG16 net are used by Ref. [10] to classify images of tomato leaves in six disease classes and healthy classes with a dataset from PlantVillage. Reference [11] used a deep Convolution Neural Network (CNN) on a healthy leaf dataset of 54306 images from PlantVillage to identify 14 crop species and 26 diseases. Identification of symptoms of disease in cassava leaves is carried out by Ref. [12] using CNN model.
Reference [13] identified apple leaf disease using AlexNet. The dataset used was 13.689 images of apple leaves and achieved an accuracy of 97.62%. Reference [14]  According to Ref. [20], diseases in plants can be caused by bacteria, viruses, and fungi. Viral diseases are the most difficult to diagnose and control their spread. The characteristics of the plants affected by the virus can be observed from the leaves. It becomes tangled and curly and has stunted growth. Small pale spots usually characterize leaves that are attacked by bacteria. For fungi, it will be easily identified through its morphological characteristics.
Tea disease identification has been proposed in some studies. Reference [21] used 26 tea plant samples with typical discoloration symptoms from different tea gardens. They conducted a metagenomic analysis based on next-generation sequencing. Reference [22] tried to identify diseases in tea plants. The disease included algal leaf spots, brown blight, gray blight, blister blight, horsehair blight, twig dieback, and canker stem.
Similarly, Ref. [23] used CNNs to recognize plant diseases using images of tea leaves automatically. A CNN model called LeafNet was proposed in the study. It was a sequential model in which CNN was stacked on the top of the preceding layers, and the flow of information only went in one direction.
Reference [24] conducted a study for the introduction of disease in tea leaves by using the Neural Network Ensemble (NNE) for pattern recognition. The study also implemented sequential networks with a multi-layer perceptron. However, the sequential model might have limitations, especially when networks with many layers were implemented. First, the networks might lose some information due to the implementation of pooling layers. This issue might apply to the case of CNN. Second, networks might need more data for training due to the high number of parameters to be optimized. Lastly, the parameters of the networks were more susceptible to a local minimum.
For these reasons, the researchers choose concatenated CNN for this task. The concatenated network allows the flow of information to have more than one direction. By doing so, the researchers can allow one flow of information to carry information from previous networks that may have been lost. In addition, previous researchers find that using a concatenated network makes the parameters less susceptible to a local minimum. Previous researchers have strongly  indicated that the concatenated CNN achieves better performance than the sequential model [25][26][27][28][29]. In addition, to concatenate, deep residual learning is also used to improve performance [28,29]. The researchers apply three concatenated CNN architectures in this study, namely GoogLeNet, Xception, and Inception-ResNet-v2.

II. RESEARCH METHOD
The flow diagram of this research method can be seen in Fig. 1. The researchers divided the group of data into four data class labels. The first data classes are combined between healthy tea plants and blister blight. The second data classes consist of the healthy tea plant, blister blight, and Empoasca sp. Then, the last data classes are the combination of all data with healthy tea plants, blister blight, Empoasca sp., and looper caterpillar.
The researchers choose 80% of the data for training data, 10% for data validation, and 10% for testing data. Three architectures are used in this study for concatenated CNN. Those are GoogLeNet, Xception, and Inception-ResNet-v2. All architectures use the same training and validation sets with the Rectified Linear Units (ReLU) activation function. The batch size is 10.

A. Convolution Neural Network
CNN is a commonly used architecture for many tasks in object recognition. It is a variant of Multi-Layer Perceptron (MLP). The nodes of succeeding layers in MLP are all connected to all the nodes from preceding layers, but it is not the case on CNN. The nodes are only connected to some neighboring nodes of preceding layers. This benefits the object recognition tasks, which may learn nearby pixels of the image to determine the class objects. For another advantage of CNN for image data, it is designed to process two or more-dimensional data so that each neuron on CNN is presented in two or more-dimensional form. Data that propagates on CNN has a linear operation and different weighting parameters. Linear operations on CNN use convolution with four-dimensional weights, which are the convolution kernel assemblies. In this CNN algorithm, the input from the previous layer is not a one-dimensional array but a two-dimensional array. The excess of CNN that uses dimensions more than one affects the overall scale in an object. The entire scale of the object is significant so that the input does not lose its spatial information, which will extract and classify the features. Thus, the CNN algorithm will increase the level of accuracy and optimization.
CNN forms its neurons in three dimensions (length, width, and height) in a layer. Figure 2 shows CNN in three dimensions in one of the layers.

B. Concatenated Convolution Neural Network
Many studies that implemented CNN focus on depth. The examples of CNN using DCNN include LeNet, AlexNet, ZFNet, VGGNet, and ResNet. However, very deep networks are prone to overfit. Meanwhile, stacking large convolution operations is computationally expensive. For this reason, a concatenated CNN module is developed. The module does not only focus on the depth factor but also the width factor. The examples of concatenated CNN are GoogLeNet and Xception.
The concatenated CNN concept is developed from the inception module. The inception module is used for more efficient computational and deeper networks. It reduces a dimension with stacked 1 × 1 convolution. Figure 3 shows the concept of the inception module. Convolution in the input is executed with three different filter sizes: 1×1, 3×3, and 5×5. Additionally, max pooling is also executed. The outputs are concatenated and sent to the next inception module.
One of the problems with training networks, especially deep neural networks, is that it vanishes and explodes the gradients. The problem of vanishing or   exploding gradients for a long time is a big barrier to train deep neural networks. In training a very deep network, it obtains very large, very small, or even exponentially small derivatives or slopes. Thus, this makes training difficult. The concatenated concept can reduce these problems.
In this research, the researchers use three kinds of concatenated CNN architecture. Those are GoogLeNet, Xception, and Inception-ResNet-v2.

C. GoogLeNet
The GoogLeNet architecture is proposed by Ref. [26]. GoogLeNet has optimized depth and the width factor of the network to raise accuracy. To improve the performance of deep neural networks, it is usually by increasing the size of the network. However, if the size of the network is increased, it will increase the number of parameters to be trained. GoogLeNet solves this problem by using the inception module. The inception module deploys a parallel combination of convolution. Then, 1 × 1 convolutions are used for computing the reduction before 3 × 3 and 5 × 5 convolutions. GoogLeNet has nine inception modules stacked linearly. It consists of 22 deep layers and has a lower 5 million parameters in the networks. The schematic module of inception with dimension reduction in GoogLeNet is depicted in Fig. 4.

D. Xception
An Xception, which stands for extreme inception, is first proposed by Ref. [27]. Xception is a CNN Exception architecture is based entirely on convolutional layers that can be separated in depthwise. Convolutionally separated module is almost identical to the extreme form of the initial module as depicted in Fig. 5. The difference between convolution that can be separated in depthwise and the extreme beginning is convolution, which can be separated from doing wise spatial convolution first and 1 × 1 convolution. Meanwhile, the inception performs 1 × 1 convolution first. Then, Xception architecture can be seen in Fig. 6.

E. Inception-ResNet-v2
The Inception-ResNet-v2 is first introduced by Ref. [29]. Inception-ResNet-v2 architecture combines the principles of the structure of Inception and residual connections. Convolution filters are combined with residual connections evading degradation problems caused by deep structures. Residual connections can reduce training time. Thus, it is more efficient than without using residual connections. Figure 7 presents the schema and stem for the Inception-ResNet-v2 network. Figure 8 shows the general scheme for scaling combined Inception-ResNet module. For arbitrary subnetworks, the inception block will be replaced using a scaling block that only scales the last linear activation with the appropriate constant. Reducing residue before adding to previous layer activation can stabilize training. In general, some residual scaling factor values are used between 0.1 and 0.3 before being added to the activation of the accumulated layers.

III. EXPERIMENTAL SETUP
In this study, the researchers collect 4727 images of tea leaves. The data are collected from plantations at

E. Inception-ResNet-v2
The Inception-ResNet-v2 is first introduced by Ref. [29]. Inception-ResNet-v2 architecture combines the principles of the structure of Inception and residual connections. Convolution filters are combined with residual connections evading degradation problems caused by deep structures. Residual connections can reduce training time. Thus, it is more efficient than without using residual connections. Figure 7 presents the schema and stem for the Inception-ResNet-v2 network.  Figure 8 shows the general scheme for scaling combined Inception-ResNet module. For arbitrary subnetworks, the inception block will be replaced using a scaling block that only scales the last linear activation with the appropriate constant. Reducing residue before adding to previous layer activation can stabilize training. In general, some residual scaling factor values are used between 0.1 and 0.3 before being added to the activation of the accumulated layers.  Table I. Meanwhile, the samples of the collected data are portrayed in Fig. 9.       Table I. Meanwhile, the samples of the collected data are portrayed in Fig. 9. For the experiment, the researchers resize all the images into fixed 64 × 64 pixels and extract values of RGB from images as features. The data are divided into three subsets: training, validation, and testing sets. The researchers select 80% of the data for training, 10% for validating, and 10% for testing. A tabular representation of the training images in this research is shown in Table II. The researchers have a total of 108 experimental configurations, which vary on the following parameters: 1) Architecture: GoogLeNet, Xception, and Inception-ResNet-v2. 2) Activation function: ReLU. It is one of the activation functions used in deep learning. The function returns to 0 if it receives any negative input. However, for any positive x value, it returns that value back. It is a simple function, but it can allow a model to account for non-linearities and interactions well. This is the reason why the researchers choose ReLU as an activation function. 3) Optimizer: RMSprop and Adam. RMSprop deals with its radically diminishing learning rates. It keeps the moving average of the squared gradients for each weight and divides the gradient by square root of the mean square. This is why it is called RMSprop (root mean square). Moreover, Adam is an adaptive learning rate optimization algorithm that is designed specifically for training deep neural networks [30]. Adam is a method of the adaptive learning rate. It calculates the rate of individual learning for different parameters. The name comes from the estimation of adaptive moments and uses the first and second-moment gradient estimates to adapt the learning rate for each weight of the neural network. 4) Batch size: 10 5) Epoch: 50, 100, and 200 6) Learning rate: 10 −5 and 10 −4 Then, the researchers implement the concatenated CNN training in python using Tensorflow and Keras package.

IV. RESULTS AND DISCUSSION
The results of the experiments are shown in Tables III-VI. The results are the accuracy of the test data. The results clearly show that the performance of Xception and Inception-ResNet-v2 is better than GoogLeNet. This is expected since Xception and Inception-ResNet-v2 are CNN with significantly more complex performance and have more layers. However, the results show that the performance can be further improved by adding more training data. It is indicated that performance tends to decrease when the researchers use more class labels. The best result for two data class diseases achieves 98.09% of accuracy using a 10 −4 learning rate. Meanwhile, the results using three data classes are 90.05% of accuracy using the 10 −4 learning rate. For the four data classes, the best correctly classified data obtain 89.64% of accuracy using the 10 −4 learning rate.   For the experiment, the researchers resize all the images into fixed 64 × 64 pixels and extract values of RGB from images as features. The data are divided into three subsets: training, validation, and testing sets. The researchers select 80% of the data for training, 10% for validating, and 10% for testing. A tabular representation of the training images in this research is shown in Table II.  In general, the results show that most of the use of Adam optimizer has outperformed the RMSProp optimizer. Adam optimizer is considered to be able to handle sparse gradients on noisy problems. Bias-  The number of epochs used to turn out not to be too significant for accuracy. The more epochs of data training increase accuracy. However, it does not apply linearly. This result is expected since the network may be overfitting when the researchers use a large number of epochs. It may be interesting to see the effect of using regularization in a future study to mitigate the problems.
The progression of the accuracy of the three models is shown in Fig. 10. Figure 10a shows the validation accuracy increasing rapidly in the first five epochs. After that, it is slowly ascended. Validation accuracy is relatively fluctuating until it finally tends to be stable around 189 to 200 epochs. For the GoogLeNet model, it converges slower than Xception and Inception-ResNet-v2. Based on the figure and number, GoogLeNet requires a larger number of epoch than Xception and Inception-ResNet-v2.
Meanwhile, Xception and Inception-ResNet-v2 converge much faster than GoogleNet ( Fig. 10a and b). For both architectures, the accuracy of training data rapidly increases around the twenty-fifth epoch. It remains relatively the same afterward.
For Xception, Fig. 10b shows that validation accuracy increases rapidly at the first 20 epochs. Then, it increases slowly until around the fortieth epoch. It tends to be stable with a little fluctuation from 41 to 200 epochs. Convergence is achieved when the epoch is at around 121. The validation accuracy of the model does not increase despite more training iterations.
For Inception-ResNet-v2, Figure 10c presents that the validation accuracy increases rapidly at the first seventeenth epochs. It increases slowly until around forty-fourth epoch and tends to stable with a little fluctuation from 45 to 200 epochs. Convergence is achieved when it enters the epoch of 49. This shows that the Inception-ResNet-v2 model is faster to convergence than Xception.
The learning rate serves to control how fast a model in solving problems. The larger levels of learning rates produce rapid changes and require less training epoch. Meanwhile, the smaller levels of learning rates require more training epoch because the smaller changes are made for the weight of each update. The level of learning rates that are too small can cause the optimization process to trap at a local minimum. Learning rates that are too large can cause the model to converge too quickly to sub-optimal solutions. The learning rate of 10 −5 and 10 −4 proves to be the right choice as a hyperparameter. It controls how much the change in the model is in response to the experiment.
Tables III-VI show that for all data classes, the use of 10 −5 learning rate gets the highest accuracy of 31.48% of the 108 experimental data. Meanwhile, the use of 10 −4 learning gets the highest accuracy of 68.52%. The data has shown that for this case, the optimal learning rate is achieved by using 10 −4 . From the researchers' observation, RMSProp with learning rate 10 −4 is more sensitive than 10 −5 . This is proven by 62.96% of accuracy with 10 −4 as it is higher compared to 37.04% of accuracy with 10 −5 . If the researchers use Adam optimizer, 74.07% of accuracy with 10 −4 is higher than 25.93% of accuracy with 10 −5 . This result shows that by using the RMSProp optimizer or Adam optimizer, the learning rate of 10 −4 results is more sensitive than 10 −5 . Figure 11 shows the progress of the accuracy of three classes of Inception-ResNet-v2 architecture with the same parameters (Adam optimizer, 200 epochs, and learning rate = 10 −4 ). The increase in the task with the addition of tea disease classes starts from two classes until final four classes. It shows that the validation value decreases along with the increase in the given task. Thus, the accuracy value decreases. The data sample used in this research is unbalanced (  For Inception-ResNet-v2, Figure 10c presents that the validation accuracy increases rapidly at the first seventeenth epochs. It increases slowly until around forty-fourth epoch and tends to stable with a little fluctuation from 45 to 200 epochs. Convergence is achieved when it enters the epoch of 49. This shows that the Inception-ResNet-v2 model is faster to convergence than Xception.
The learning rate serves to control how fast a model in solving problems. The larger levels of learning rates produce rapid changes and require less training epoch. Meanwhile, the smaller levels of learning rates require more training epoch because the smaller changes are made for the weight of each update. The level of learning rates that are too small can cause the optimization process to trap at a local minimum. Learning rates that are too large can cause the model to converge too quickly to sub-optimal solutions. The learning rate of 10 -5 and 10 -4 proves to be the right choice as a hyperparameter. It controls how much the change in the model is in response to the experiment.
From Tables III -VI show that for all data classes, the use of 10 -5 learning rate gets the highest accuracy of 31.48% of the 108 experimental data. Meanwhile, the use of 10 -4 learning gets the highest accuracy of 68.52%. The data has shown that for this case, the optimal learning rate is achieved by using 10 -4 . From the researchers' observation, RMSProp with learning rate 10 -4 is more sensitive than 10 -5 . This is proven by 62.96% of accuracy with 10 -4 as it is higher compared to 37.04% of accuracy with 10 -5 . If the researchers use Adam optimizer, 74.07% of accuracy with 10 -4 is higher than 25.93% of accuracy with 10 -5 . This result shows that by using the RMSProp optimizer or Adam optimizer, the learning rate of 10 -4 results is more sensitive than 10 -5 .  For Inception-ResNet-v2, Figure 10c presents that the validation accuracy increases rapidly at the first seventeenth epochs. It increases slowly until around forty-fourth epoch and tends to stable with a little fluctuation from 45 to 200 epochs. Convergence is achieved when it enters the epoch of 49. This shows that the Inception-ResNet-v2 model is faster to convergence than Xception.
The learning rate serves to control how fast a model in solving problems. The larger levels of learning rates produce rapid changes and require less training epoch. Meanwhile, the smaller levels of learning rates require more training epoch because the smaller changes are made for the weight of each update. The level of learning rates that are too small can cause the optimization process to trap at a local minimum. Learning rates that are too large can cause the model to converge too quickly to sub-optimal solutions. The learning rate of 10 -5 and 10 -4 proves to be the right choice as a hyperparameter. It controls how much the change in the model is in response to the experiment.
From Tables III -VI show that for all data classes, the use of 10 -5 learning rate gets the highest accuracy of 31.48% of the 108 experimental data. Meanwhile, the use of 10 -4 learning gets the highest accuracy of 68.52%. The data has shown that for this case, the optimal learning rate is achieved by using 10 -4 . From the researchers' observation, RMSProp with learning rate 10 -4 is more sensitive than 10 -5 . This is proven by 62.96% of accuracy with 10 -4 as it is higher compared to 37.04% of accuracy with 10 -5 . If the researchers use Adam optimizer, 74.07% of accuracy with 10 -4 is higher than 25.93% of accuracy with 10 -5 . This result shows that by using the RMSProp optimizer or Adam optimizer, the learning rate of 10 -4 results is more sensitive than 10 -5 . Even though the results are good, the proposed method is still constrained by the length of training time. For forthcoming work, to speed up the training process and boost accuracy, future researchers can use batch normalization. They can also adjust the parameter to train well. To improve accuracy, the dataset can be reproduced and made balance.