Fish Classification System Using YOLOv3-ResNet18 Model for Mobile Phones

—Every country in the world needs to report its fish production to the Food and Agriculture Organization of the United Nations (FAO) every year. In 2018, Indonesia ranked top five countries in fish production, with 8 million tons globally. Although it ranks top five, the fisheries in Indonesia are mostly dominated by traditional and small industries. Hence, a solution based on computer vision is needed to help detect and classify the fish caught every year. The research presents a method to detect and classify fish on mobile devices using the YOLOv3 model combined with ResNet18 as a backbone. For the experiment, the dataset used is four types of fish gathered from scraping across the Internet and taken from local markets and harbors with a total of 4,000 images. In comparison, two models are used: SSD-VGG and autogenerated model Huawei ExeML. The results show that the YOLOv3-ResNet18 model produces 98.45% accuracy in training and 98.15% in evaluation. The model is also tested on mobile devices and produces a speed of 2,115 ms on Huawei P40 and 3,571 ms on Realme 7. It can be concluded that the research presents a smaller-size model which is suitable for mobile devices while maintaining good accuracy and precision

Abstract-Every country in the world needs to report its fish production to the Food and Agriculture Organization of the United Nations (FAO) every year. In 2018, Indonesia ranked top five countries in fish production, with 8 million tons globally. Although it ranks top five, the fisheries in Indonesia are mostly dominated by traditional and small industries. Hence, a solution based on computer vision is needed to help detect and classify the fish caught every year. The research presents a method to detect and classify fish on mobile devices using the YOLOv3 model combined with ResNet18 as a backbone. For the experiment, the dataset used is four types of fish gathered from scraping across the Internet and taken from local markets and harbors with a total of 4,000 images. In comparison, two models are used: SSD-VGG and autogenerated model Huawei ExeML. The results show that the YOLOv3-ResNet18 model produces 98.45% accuracy in training and 98.15% in evaluation. The model is also tested on mobile devices and produces a speed of 2,115 ms on Huawei P40 and 3,571 ms on Realme 7. It can be concluded that the research presents a smaller-size model which is suitable for mobile devices while maintaining good accuracy and precision.
Index Terms-Fish Classification System, YOLOv3-ResNet18 Model, Mobile Phone

I. INTRODUCTION
A CCORDING to the Food and Agriculture Organization of the United Nations (FAO), Indonesia ranked top five countries in fish production with 8 million tons in 2018 [1]. All countries must monitor and report fish caught every year to FAO to avoid Received: Feb. 21, 2022; received in revised form: May 27, 2022; accepted: July 27, 2022; available online: March 17, 2023. *Corresponding Author overfishing. Although Indonesia is a maritime country with 62% of the area covered by the sea, it is still dominated by small-scale industries that use traditional ships and equipment [2]. Hence, manual calculation is still used to determine the total and type of caught fish. It makes the report inaccurate.
To make the data and report more reliable and accurate, computer vision, one of the famous research topics, can help sailors and the Indonesian government to calculate and classify the fish produced by small industries. Computer vision is already used in other fields, such as facial expression [3], healthcare [4,5] and agriculture [6,7]. Then, many researchers propose their methods. The examples are Region-Based Convolutional Neural Network (R-CNN) [8], Fast R-CNN [9], Faster R-CNN [10] that use the Region Proposal Network method, Single Shot Detector (SSD) [11], and You Only Look Once (YOLO) [12]. YOLO is one of favorite topics because it has fast detection speed and retains good accuracy. It predicts objects using a single neural network that divides into multiple regions and predicts a boundary box for an object in an image.
Many studies have also analyzed fish. For example, YOLOv3 is used to detect underwater object recognition [13]. Another research applies MobileNetv1 as a backbone to detect and classify fish to produce a more accurate and lightweight model [14]. Then, there is an improved model of YOLOv3 by using four detection scales, k-means ++ clustering, and novel transfer learning [15]. Other improvements for There are several studies focused on low computational devices. The Tiny-YOLOv3 model is compared on several platforms, such as TensorFlow Lite, OpenCV, and SNPE on Android, and has an accuracy of 33.1% on the COCO dataset and 33.8 MB file size [23]. Another experiment tests YOLOv3 and Tiny-YOLOv3 SSD models on DJI Drone and Android with 55.3% and 33.1% accuracy with 248 MB and 35.4 MB file sizes [24]. Then, Alexnet and Googlenet are compared on the smartphone with 57.4% and 59.3% accuracy with the Art Sculpture dataset [25]. Another approach using SSD running on the smartphone to detect 3D assets is conducted by achieving 75% accuracy with a 22 MB file size [26]. Although it is possible to achieve a model for low computational devices, the research conducted by the previous researchers shows that the small file size model that is suitable for the smartphone but still has low accuracy. Hence, the accuracy needs to be improved while retaining the file size.
The research contributes to producing an accurate and compatible model for mobile devices. The current YOLOv3 model is modified to achieve the solution by implementing the ResNet18 backbone to replace the Darknet53. The model is also evaluated on mobile devices to compare the inference time and memory usage.
(IoU) with ground truth, as shown in Equation [5]. The value of IoU is generated by dividing the area over the predicted bounding box, and area B, the ground truth bounding box ( ∩ ), with the union of area A a However, some large objects sometimes create multiple bounding boxes. The multiple bounding boxes d object are reduced to one by using Non-Maximum Suppression (NMS) [29] [22], as shown in Figure 2 overlapping bounding boxes by comparing the IoU score to the threshold score (usually more than 0.6). If the boxes is lower than the threshold, the predicted box will be removed. After that, all the remaining boxes wi confidence score and compared to the confidence threshold score to reduce the boxes into one final predictin

II. RESEARCH METHOD
The research method carried out in the experiment is presented. The YOLOv3 model with the Darknet53 backbone network is changed into the ResNet18 backbone to perform fish detection in the image dataset. The YOLOv3 model is famous for its detection speed, and the ResNet18 model has significant accuracy while having a smaller backbone than the YOLOv3 model [20,27].

A. You Only Look Once (YOLO)
The convolution enables to compute prediction of an object in an image in an optimized way. This solution avoids using a sliding window to compute and predict the object. On the other hand, YOLO detects the object in the image by creating a bounding box for each detected object using a single neural network instead of predicting the box of the entire image [12,27,28]. YOLO uses Eqs. 1-4 to predict the bounding box. Those are also shown in Fig. 1. The b x and b y represent the center coordinate. Then, b w is the width, and b h is the height of the predicted bounding box. The c x and c y are the top-left coordinates of the grid cell, and p w and p h are the anchor dimensions. The equations can be seen as follows.   Redmon then introduces YOLOv2 with changes of a fully connected layer to the anchor box and adds batch normalization to tackle the overfitting problem in YOLO [29]. The resolution used to detect also changes from 224×224 to 448×448. Next, the Darknet-19 frame is introduced as the neural network framework of YOLOv2, consisting of 19 convolutional layers and 5 maxpooling layers. The accuracy of YOLOv2 increases from 63.4 to 78.6 mAP. In addition, the YOLOv3 model predicts 4 bounding box coordinates by using logistic regression to predict the score for each bounding box [30]. For class prediction, YOLOv3 uses an independent logistic classifier. Finally, the Darknet53 framework is introduced to replace YOLOv2 Darknet19 [28].

B. Residual Network (ResNet)
ResNet was introduced in 2015 as a specific type of neural network [20]. The additional layers are stacked in the Deep Neural Networks to improve accuracy and performance, mostly to solve a complex problem. However, there is a maximum threshold for depth with the traditional convolutional neural network model, resulting in more errors. ResNet uses skip connections to solve the vanishing gradient problem in Deep Neural Networks by allowing alternate shortcut paths for the gradient to flow through.  After the bounding box is predicted, the model also predicts the confidence score on the bounding box. The bounding box is then determined by assigning one predictor called 'responsible' to predict the object based on the highest Intersection over Union (IoU) with ground truth, as shown in Eq. 5. The value of IoU is generated by dividing the area overlap between area A, the predicted bounding box, and area B, the ground truth bounding box (A ∩ B), with the union of area A and area B (A ∪ B). However, some large objects sometimes create multiple bounding boxes. The multiple bounding boxes detected in the same object are reduced to one by using Non-Maximum Suppression (NMS) [22,29], as shown in Fig. 2. The NMS reduces overlapping bounding boxes by comparing the IoU score to the threshold score (usually more than 0.6). If the IoU of the predicted boxes is lower than the threshold, the predicted box will be removed. After that, all the remaining boxes will be checked for the confidence score and compared to the confidence threshold score to reduce the boxes into one final predicting box.
Redmon then introduces YOLOv2 with changes of a fully connected layer to the anchor box and adds batch normalization to tackle the overfitting problem in YOLO [29]. The resolution used to detect also changes from 224×224 to 448×448. Next, the Darknet-19 frame is introduced as the neural network framework of YOLOv2, consisting of 19 convolutional layers and 5 max-pooling layers. The accuracy of YOLOv2 increases from 63.4 to 78.6 mAP. In addition, the YOLOv3 model predicts 4 bounding box coordinates by using logistic regression to predict the score for each bounding box [30]. For class prediction, YOLOv3 uses an independent logistic classifier. Finally, the Dark-net53 framework is introduced to replace YOLOv2 Darknet19 [28]. Figure 3 shows Darknet53 backbone  Redmon then introduces YOLOv2 with changes of a fully connected layer to the anchor box and adds bat tackle the overfitting problem in YOLO [29]. The resolution used to detect also changes from 224×224 to 4 Darknet-19 frame is introduced as the neural network framework of YOLOv2, consisting of 19 convolutiona pooling layers. The accuracy of YOLOv2 increases from 63.4 to 78.6 mAP. In addition, the YOLOv3 model p box coordinates by using logistic regression to predict the score for each bounding box [30]. For class predict an independent logistic classifier. Finally, the Darknet53 framework is introduced to replace YOLOv2 Darkn

B. Residual Network (ResNet)
ResNet was introduced in 2015 as a specific type of neural network [20]. The additional layers are stacked Networks to improve accuracy and performance, mostly to solve a complex problem. However, there is a m for depth with the traditional convolutional neural network model, resulting in more errors. ResNet uses skip c the vanishing gradient problem in Deep Neural Networks by allowing alternate shortcut paths for the gradien architecture as a feature extractor, mainly composed of 3×3 and 1×1 filters with the residual network to skip connections like in ResNet.

B. Residual Network (ResNet)
ResNet was introduced in 2015 as a specific type of neural network [20]. The additional layers are stacked in the Deep Neural Networks to improve accuracy and performance, mostly to solve a complex problem.    Table I uses a layer of plain network architecture to which the shortcut connection is added. The architecture is mainly composed of 3×3 convolution with a fixed feature of map of 64, 128, 256, and 512, respectively and bypassing the input every two convolutions. The difference between each ResNet type (ResNet-18, ResNet-50, ResNet-152) is in convolution 2.x layer. There is an additional 1×1 convolution on each step with a different feature map (256, 512, 1024, 2048). Additionally, in ResNet-152, the number of convolutions happens in convolution 3.x and convolution 4.x doubles up to 8 layers and 36 layers.
Another previous research, as shown in Table II, compares the performance of several ResNet model families [31]. The previous research is conducted using a benchmark with a minibatch size of 16 and image size of 224×224 and running on GTX 1080 with 8 GB of memory. The speed shown in Table II is the total time for a forward and backward pass. Since ResNet18 has a smaller layer than others, the speed is relatively faster. Therefore, ResNet18 is used as the based backbone for the solution model.

C. The Model
The standard YOLOv3 model is not entirely compatible with deployed on mobile devices [27]. As a result, previous research decreases the depth of the convolutional layer called YOLOv3-Tiny [19]. The running speed significantly increases, but the detection accuracy is reduced. YOLOv3-Tiny reduces the number of convolutional layers and uses a pooling layer. Then, it divides the picture into S×S grid cells and ignores the bounding box with not the best objectness score. However, in the other method, the proposed model tested on mobile devices still achieves low accuracy [23][24][25][26]. To solve this problem, ResNet18 is used since it produces the fastest speed in previous research [31].
The YOLOv3-ResNet18 model loads the ResNet18 pre-trained model on ImageNet-1k. This model acts as a feature extraction to replace the YOLO Darknet53 backbone, as seen in Fig. 4. The method uses pretrained model networks to initialize all layers, except the top fully connected layer whose weights are randomly initialized. As a result, the convolutional layer for feature extraction is reduced to 18 layers deep. The YOLOv3 classifier is used for the classifier layer instead of the regular ResNet18 classifier layer. Then, the result continues with detection layers with scale 1, scale 2, and the same scale with the standard YOLOv3 model. The network is trained on the fish dataset to detect and classify the fish on the images.  The YOLOv3-ResNet18 model loads the ResNet18 pre-trained model on ImageNet-1k. This model acts as a feature extraction to replace the YOLO Darknet53 backbone, as seen in Figure 4. The method uses pre-trained model networks to initialize all layers, except the top fully connected layer whose weights are randomly initialized. As a result, the convolutional layer for feature extraction is reduced to 18 layers deep. The YOLOv3 classifier is used for the classifier layer instead of the regular ResNet18 classifier layer. Then, the result continues with detection layers with scale 1, scale 2, and the same scale with the standard YOLOv3 model. The network is trained on the fish dataset to detect and classify the fish on the images.

D. Single Shot Detector (SSD)
Proposed in 2016, SSD was a model based on a feed-forward convolutional network that produced a fix-sized bounding box, scored the class of each bounding box, and used non-max suppression to produce the final bounding box [11]. SSD uses multiscale feature maps to detect multiple scales, and the convolutional layers are added to the end truncated base network. SSD uses VGG-16 to extract feature maps from the image, followed by six convolutional layers. Conv4_3 layers are used to detect objects. SSD improvement is proposed by adding SSD to the MobileNet base network [34].

E. Huawei ExeML
ExeML automates the model design, parameter tuning, training, compression, and deployment with already labeled data [35]. The process does not require the developer to code anything. It optimizes the model to use in the mobile device to produce the model with low inference time but still has decent accuracy. The developer only needs to upload the dataset and set the label for training. Then, Huawei processes the dataset with the auto-training feature and generates the model. Then, the model can be deployed to be used as API or downloaded to be embedded on mobile devices.  Figure 5. The resolution for the image varies from low-resolution images (303×166) in the Katsuwonus Pelamis class to high-resolution images (1300×1011). Image dataset is gathered from scraping across the Internet and taken from local market and harbor. The image is then split into 80% for training, 10% for validation, and 10% for evaluation.

III. RESULT AND DISCUSSION
The dataset is prepared by giving the label to all images. In the research, all the labeling data and training processes are conducted on the Huawei cloud using respective labeling and training features. The specification used for training is NVIDIA-V100 32GB GPU, 8vCPUs, and 64GB RAM. Then, the model is trained with 2,000 epochs, 32 batch sizes, and 0.0001 learning box, and used non-max suppression to produce the final bounding box [11]. SSD uses multi-scale feature maps to detect multiple scales, and the convolutional layers are added to the end truncated base network. SSD uses VGG-16 to extract feature maps from the image, followed by six convolutional layers. Conv4 3 layers are used to detect objects. SSD improvement is proposed by adding SSD to the MobileNet base network [32].

E. Huawei ExeML
ExeML automates the model design, parameter tuning, training, compression, and deployment with already labeled data [33]. The process does not require the developer to code anything. It optimizes the model to use in the mobile device to produce the model with low inference time but still has decent accuracy. The developer only needs to upload the dataset and set the label for training. Then, Huawei processes the dataset with the auto-training feature and generates the model. Then, the model can be deployed to be used as API or downloaded to be embedded on mobile devices.

F. Dataset
The dataset used in the experiment consists of 4,000 images with JPEG image encoding. The dataset is classified into four classes: Katsuwonus Pelamis, Euthynnus Affinis, Coryphaena Hippurus, and Loligo Chinensis. The examples are in Fig. 5. The resolution for the image varies from low-resolution images (303×166) in the Katsuwonus Pelamis class to highresolution images (1300×1011). Image dataset is gathered from scraping across the Internet and taken from local market and harbor. The image is then split into The YOLOv3-ResNet18 model loads the ResNet18 pre-trained model on to replace the YOLO Darknet53 backbone, as seen in Figure 4. The metho layers, except the top fully connected layer whose weights are randomly initia extraction is reduced to 18 layers deep. The YOLOv3 classifier is used for classifier layer. Then, the result continues with detection layers with scale YOLOv3 model. The network is trained on the fish dataset to detect and clas

D. Single Shot Detector (SSD)
Proposed in 2016, SSD was a model based on a feed-forward convolution scored the class of each bounding box, and used non-max suppression to pro scale feature maps to detect multiple scales, and the convolutional layers are VGG-16 to extract feature maps from the image, followed by six convolution SSD improvement is proposed by adding SSD to the MobileNet base networ

E. Huawei ExeML
ExeML automates the model design, parameter tuning, training, compres The process does not require the developer to code anything. It optimizes th model with low inference time but still has decent accuracy. The developer o training. Then, Huawei processes the dataset with the auto-training feature deployed to be used as API or downloaded to be embedded on mobile device The dataset used in the experiment consists of 4,000 images with JPEG classes: Katsuwonus Pelamis, Euthynnus Affinis, Coryphaena Hippurus, an The resolution for the image varies from low-resolution images (303×166) images (1300×1011). Image dataset is gathered from scraping across the In image is then split into 80% for training, 10% for validation, and 10% for ev

III. RESULT AND DISCUSSIO
The dataset is prepared by giving the label to all images. In the resear conducted on the Huawei cloud using respective labeling and training featu V100 32GB GPU, 8vCPUs, and 64GB RAM. Then, the model is trained wit 80% for training, 10% for validation, and 10% for evaluation.

III. RESULTS AND DISCUSSION
The dataset is prepared by giving the label to all images. In the research, all the labeling data and training processes are conducted on the Huawei cloud using respective labeling and training features. The specification used for training is NVIDIA-V100 32GB GPU, 8vCPUs, and 64GB RAM. Then, the model is trained with 2,000 epochs, 32 batch sizes, and 0.0001 learning rates. Other models used to compare YOLOv3-ResNet18 model performance are SSD-VGG and Huawei ExeML. All models are also trained in the same hyperparameter. In Table III, the YOLOv3-ResNet18 model has a smaller file size than other models. This small size is suitable for mobile devices.
In Table IV   ss confusion matrix is used to evaluate the models. The confusion matrix is represented in 4×4 matrix form sists of four classes. Then, the confusion matrix is calculated to find the evaluated models' accuracy, F1 score. First, the True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) e matrix results. Then, the accuracy, precision, recall, and F1 score are also calculated.
result of YOLOv3-Resnet18 model. Next, the multiclass confusion matrix is used to evaluate the models. The confusion matrix is represented in 4×4 matrix form since the dataset consists of four classes. Then, the confusion matrix is calculated to find the evaluated models' accuracy, precision, recall, and F1 score. First, the True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) are calculated from the matrix results. Then, the accuracy, precision, recall, and F1 score are also calculated. Figures 6 and 7 show the confusion matrix result of YOLOv3-Resnet18 and Huawei ExeML models. Both models have similar results in detecting and predicting the four datasets. Both models perform the same correct prediction on Loligo Chinensis datasets with 479 correct detections. Interestingly, Huawei ExeML model have tendencies to detect the fish as Loligo Chinensis.
Both results are calculated to get the accuracy, precision, recall, and F1 score to understand the matrix more deeply. The result shown in Table V     Both results are calculated to get the accuracy, precision, recall, and F1 score to understand the matrix mo shown in Table 5 is the evaluation of the YOLOv3-ResNet18 model. The model falls on recall of Coryp 0.92. For comparison, based on Table 6, the Huawei ExeML model has an overall accuracy of 0.97. Since th mainly to detect Katsuwonus Pelamis and Euthynnus Affinis, the YOLOv3-ResNet18 model performs precision, and recall with an average score of 0.98, 0.97, and 0.96, respectively. Meanwhile, the Huawei Ex 0.97 accuracy, 0.96 precision, and 0.95 recall.
The test continues by deploying the model on the Huawei Cloud server as Application Programming step ensures the model can be run and deployed on mobile devices. The mobile device sends the image to t processed by the model. Then, the model produces a result (Figure 8) with returning predicted bounding bo boxes are then displayed on mobile devices.   The test continues by deploying the model on the Huawei Cloud server as Application Programming Interface (API). This step ensures the model can be run and deployed on mobile devices. The mobile device sends the image to the cloud server to be processed by the model. Then, the model produces a result (Fig. 8) with returning predicted bounding boxes. These bounding boxes are then displayed on mobile devices.
For the last evaluation process, the models that have already been downloaded are implemented into mobile devices. For this experiment, the operating system is Android with two devices for comparison: Huawei P40 and Realme 7. To achieve the implementation, the models are converted into TensorFlow Lite (tflite) extension before being loaded by the application. The t evaluation process, the models that have already been downloaded are implemented into mobi e operating system is Android with two devices for comparison: Huawei P40 and Realme n, the models are converted into TensorFlow Lite (tflite) extension before being loaded by th ification of both processors is shown in    After the model is implemented in mobile devices using the Huawei ML Kit library. The benchmark process is used to determine the inference time of the models. For this process, the benchmark uses TensorFlow Lite performance measurement tools. This test produces the inference time and overall memory usage of the models. The result of the test is shown in Figs. 9 and 10.
In Fig. 9, the YOLOv3-ResNet18 model has the fastest inference time on both tested devices. Meanwhile, in Fig. 10, Huawei ExeML has the lowest memory usage on Huawei P40 devices but increases to 44 Mb in Realme 7. Overall, YOLOv3-ResNet18 For the last evaluation process, the models that have already been downl experiment, the operating system is Android with two devices for comp implementation, the models are converted into TensorFlow Lite (tflite) ex complete specification of both processors is shown in Table 7 After the model is implemented in mobile devices using the Huawei determine the inference time of the models. For this process, the benchma tools. This test produces the inference time and overall memory usage of th and 10.
In Figure 9, the YOLOv3-ResNet18 model has the fastest inference tim Huawei ExeML has the lowest memory usage on Huawei P40 devices but ResNet18 has better inference time and memory usage on both tested device better on Huawei P40.

IV. CONCLUSION
Mobile device technology grows rapidly and has better chipsets to compute more powerful tasks. As a maritime country dominated by small-scale industries with traditional boats, Indonesia needs technology implementation to determine the type of fish they capture and the total weight of captured fish. The research 77 Cite this article as: S. Liawatimena, E. Abdurachman, A. Trisetyarso, A. Wibowo, M. K. Ario, and I. S. Edbert, "Fish Classification System Using YOLOv3-ResNet18 Model for Mobile Phones", CommIT Journal 17(1), 71-79, 2023.

IV. CONCLUSION
Mobile device technology grows rapidly and has better chipsets to compute more powerful tasks. As a maritime country dominated by small-scale industries with traditional boats, Indonesia needs technology implementation to determine the type of fish they capture and the total weight of captured fish. The research conducts an experiment to find a suitable model for mobile devices. YOLOv3, known for its good accuracy and fast detection speed combined with the Resnet-18 backbone, results in a smaller-size model suitable for mobile devices while maintaining good accuracy and precision. The experiment also shows that the model uses a small RAM size after being deployed on mobile devices and has a reasonable inference time. The experiment will be continued to make inference time faster and predict the weight of fishes captured in the future.
The current experiment only compares the models on two mobile devices. Meanwhile, the mobile devices used in the market have more various chipsets. So, the model can have a significantly different performance from other chipsets. For future experiments, the model will be deployed on various mobile devices to understand the impact of chipsets on the model inference time. In addition, more models will be compared to get the most suitable models for mobile devices. conducts an experiment to find a suitable model for mobile devices. YOLOv3, known for its good accuracy and fast detection speed combined with the Resnet-18 backbone, results in a smaller-size model suitable for mobile devices while maintaining good accuracy and precision. The experiment also shows that the model uses a small RAM size after being deployed on mobile devices and has a reasonable inference time. The experiment will be continued to make inference time faster and predict the weight of fishes captured in the future. The current experiment only compares the models on two mobile devices. Meanwhile, the mobile devices used in the market have more various chipsets. So, the model can have a significantly different performance from other chipsets. For future experiments, the model will be deployed on various mobile devices to understand the impact of chipsets on the model inference time. In addition, more models will be compared to get the most suitable models for mobile devices.