Deep Transfer Learning for Sign Language Image Classification: A Bisindo Dataset Study

– This study aims to identify and categorize the BISINDO sign language dataset, primarily consisting of image data. Deep learning techniques are used, with three pre-trained models: ResNet50 for training, MobileNetV4 for validation, and InceptionV3 for testing. The primary objective is to evaluate and compare the performance of each model based on the loss function derived during training. The training success rate provides a rough idea of the ResNet50 model’s understanding of the BISINDO dataset, while MobileNetV4 measures validation loss to understand the model’s generalization abilities. The InceptionV3-evaluated test loss serves as the ultimate litmus test for the model’s performance, evaluating its ability to classify unobserved sign language images. The results of these exhaustive experiments will determine the most effective model and achieve the highest performance in sign language recognition using the BISINDO dataset.


I. INTRODUCTION
Communication provides an essential part in social interactions, as it enables individuals or collectives to transmit information and establish connections with their surroundings and other people.Communication often involves the utilization of linguistic expressions, either orally or in written form.However, those who possess physical impairments may resort to nonverbal means of communication, such as sign language, to convey their messages.In March 2022, the number of individuals who are deaf and speech impaired in Indonesia will reach 19,392 people, which is equivalent to 9.14% of the total number of people with disabilities in Indonesia (Arisandi & Satya, 2022).
Sign language is used to overcome limitations associated with being deaf and speech impairment.In the context of Indonesia, it is common to observe the utilization of two distinct sign languages, specifically Indonesian Sign Language (BISINDO) and Indonesian Sign Language System (SIBI) (Arisandi & Satya, 2022).BISINDO, the Indonesian Sign Language, was established by the deaf community and exhibits regional variances that reflect the many origins of the community.Consequently, BISINDO is regarded as a versatile sign language system (Pusbisindo, 2023).
The usage of SIBI is challenging for individuals who are deaf or have speech impairments due to its adherence to grammatical rules and taken from American Sign Language (ASL).SIBI is only used to communicate in formal settings, such as school activities (Handhika et al., 2018)namely Indonesian Signal System (SIBI.BISINDO consists of an alphabet with 26 characters consisting of the letters A to Z, formed by one hand for the characters C, E, I, J, L, O, R, U, V, and Z, while the letter characters formed with two hands are A, B, D, F, G, H, K, M, N, P, Q, S, T, W, X, and Y, which are seen in Figure 1 (Bestari, 2018).In the comparison of responses to the use of BISINDO and SIBI in Indonesia as sign languages (Mursita, 2015), out of 100 respondents, 91% of the deaf prefer to use BISINDO rather than SIBI because they find it difficult to use SIBI rather than BISINDO, and only 9% use SIBI.BISINDO is able to enrich expressions so that it can liven up the atmosphere, make it easier to connect with lots of friends, and there are no barriers to communication (Pusbisindo, 2023).
Artificial Intelligence (AI) involves the utilization of computational methods to replicate human cognitive processes like interpretation, reasoning, decision-making, estimation, and categorization.This multidisciplinary field incorporates knowledge from diverse scientific domains such as mathematics, biology, genetics, engineering, and computer science.The primary aim of AI research is to empower computers to swiftly, effectively, accurately, and efficiently execute tasks that are typically associated with human cognition.Unlike traditional software, AI techniques can manage incomplete and uncertain data by establishing meaningful connections among data points, allowing for inferences about past occurrences and predictions concerning future outcomes.Besides its applications in design and engineering disciplines, AI is now extensively employed in various sectors, including engineering education, healthcare, transportation, economics, law, and manufacturing (Khaleel et al., 2023).
Machine Learning is a subset of AI that emphasizes the development of algorithms and statistical models that enable computers to learn and improve their performance on specific tasks through experience and data input (Fauzi et al., 2023).Deep Learning is a specialized branch of machine learning that involves artificial neural networks, specifically deep neural networks with multiple layers, allowing it to automatically discover intricate patterns and representations in data, making it particularly effective for tasks like image and speech recognition (Fadlilah et al., 2021).
Transfer learning is a technique for machine learning in which a model trained on one task or dataset is repurposed or fine-tuned to perform a different but related task or work with a different dataset.Transfer learning is a technique that capitalizes on the knowledge and features acquired during the initial training of a neural network or model, thereby enabling a substantial decrease in training time and resource demands.This methodology is especially advantageous in situations where there is a scarcity of data accessible for the specific objective, as it allows the model to leverage the insights acquired from a more extensive, interconnected dataset, hence enhancing its efficacy in the novel task (Susanty et al., 2021).
Deep transfer learning is a machine learning method that combines the capabilities of deep neural networks with transfer learning.It involves modifying a pre-existing deep learning model to fit a specific task, utilizing the hierarchical characteristics and representations from the original training.This approach is particularly useful in situations with limited data for the target task, enhancing model performance by applying knowledge from a source task to the target task.It has gained significant popularity in computer vision and natural language processing (Toengi, 2018).
The initial approach centers on the identification of sign language signs presented through visual media such as photographs or videos.This methodology involves the utilization of deep learning models, namely convolutional neural networks (CNNs), to discern visual patterns within images or video frames that correspond to sign language signs.The model has undergone training using a dataset comprising photographs of sign language signs that are suitable for multiple sign languages.Consequently, it possesses the ability to accurately identify and interpret these signs, generating corresponding textual or auditory representations (Triwijoyo et al., 2023).
The second approach places greater emphasis on the recognition of hand motions, which serve as the primary element in sign language.Deep learning is employed for the purpose of discerning and comprehending the gestures and postures of the hands that are utilized in the act of communicating through sign language.Deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have the capability to see and evaluate hand movements within video or picture data, then convert them into suitable textual or auditory representations (Li et al., 2019).
Both methodologies exhibit significant potential for enhancing accessibility for those with impairments who rely on sign language as their primary mode of communication.Deep learning enables systems to autonomously acquire knowledge from data, hence enhancing the model's proficiency in accurately detecting and understanding sign language as the volume of available data increases (Indra et al., 2019).
The sign language dataset image examines (Wadhawan & Kumar, 2020) the use of deep learningbased convolutional neural networks (CNN) for the robust modelling of static signals in sign language recognition.The study accumulates 35,000 sign images from a variety of users and assesses the performance of the proposed system on 50 CNN models.The system obtains the highest training accuracy of 99.72% and 99.90% on colored and grayscale images, demonstrating its effectiveness over earlier works.
Other research seeks (Bantupalli & Xie, 2019) to develop a vision-based application that provides sign language to text translation, thereby facilitating communication between signers and non-signers.The model derives temporal and spatial features from video sequences, utilizing Inception for spatial recognition and a recurrent neural network for temporal training.
InceptionV3 excels at efficiently capturing multiscale features due to its inception modules with multiple filter sizes (1x1, 3x3, and 5x5).It is computationally efficient, reduces parameters with global average pooling, and provides pre-trained weights for effective transfer learning, which makes it ideal for image classification tasks.InceptionV3 The study utilizes a dataset of 36 English characters and digits and achieves 90% accuracy using American Sign Language data.The modified inceptionV3 architecture outperforms earlier research by 99.81%.(Hasan et al., 2020).
The Resnet50 model effectively addresses the issue of vanishing gradients, enabling the training of deeper models and consequently achieving superior accuracy in tasks related to image recognition.The Resnet50 model demonstrated recognition accuracy ranging from 88.38% to 93.88% for illnesses and from 95.38% to 98.42% for pests in the context of disease and pest identification studies (Yin et al., 2020).
EfficientNet was used to recognize facial expressions using transfer learning, and the results obtained were high facial expression recognition accuracy, namely 99.24% for CK+in from the 10% sampling used (Alam et al., 2022).
The main aim of this study is to perform a comparative analysis using the BISINDO dataset and various algorithms such as MobileNetV4, EfficientNetB1, ResNet50, and InceptionV3 is to identify the pre-trained model that achieves the highest accuracy and fastest performance in the classification of sign language images.

II. METHODS
The Figure 2 are several method steps that can be used in research that focuses on sign language image classification by utilizing the BISINDO dataset and the transfer learning approach.

Collecting Dataset
The dataset used is a dataset obtained from the Kaggle Public dataset (Noer, 2021), which contains portraits of letters A-Z at a scale of 1:1 with three different backgrounds (plain white shirt, white wall, and white shirt with dots).The photo was taken from a front-view perspective with a distance of 70 cm between the object and the camera lens.Each background was photographed with 4 photos for each letter, so there were a total of 1,727 photos of letters A-Z.The example of sign language dataset seen in Figure 3.

Pre-Processing Dataset
Preprocessing improved the dataset before training, validation, and testing.Rescaling pixel values to [0, 1], rotating images to introduce variations, horizontally flipping images for data enhancement, applying shear transformations to account for geometric distortions, and adjusting the fill mode for transformed regions were part of this preprocessing.The dataset was then divided into three subsets: the training set, which contains 70% of the data and trains the deep learning model; the validation set, which contains 20% and monitors model performance during training and hyperparameter tuning; and the testing set, which contains 10% and evaluates the model's generalization and final performance.These preprocessing and dataset division methods create a stable and well-structured experimental design for sign language image classification, deep learning model construction, and evaluation.

Pre-Trained Model
This research utilizes three pre-trained deep learning models MobileNetV4, EfficientNetB1, and ResNet50, as well as InceptionV3, in order to leverage their extensive prior knowledge and feature extraction capabilities.These models, which were initially trained on vast datasets such as ImageNet, are proficient at recognizing intricate patterns and high-level features in images.By fine-tuning these pretrained models on our pre-processed sign language image dataset, we hope to exploit their transfer learning potential, enabling our system to recognize and classify sign language with greater accuracy and efficiency, even when working with limited data specific to our intended task.

Trained Model
The trained models, which include MobileNetV4, EfficientNetB1, ResNet50, and InceptionV3, represent the culmination of our deep learning strategy, wherein these models underwent fine-tuning on our pre-processed sign language image dataset.By adapting these pre-trained models to the complexities of sign language recognition, we have exploited their immense knowledge and feature extraction capabilities, resulting in potent classification tools for sign language BISINDO.These models have effectively learned to recognize and interpret the distinctive visual signals within images of sign language, demonstrating their skill at capturing subtle details and achieving impressive classification performance.

Evaluation & Test Model
The evaluation and testing phase involves a rigorous assessment of the trained models' performance in sign language image classification.These models, including MobileNetV4, EfficientNetB1, ResNet50, and InceptionV3, are subjected to an exhaustive battery of tests utilizing our dedicated testing subset.The purpose of these tests is to evaluate the models' ability to generalize and accurately classify unseen sign language BISINDO, thereby providing a thorough assessment of their real-world applicability by meticulously analyze accuracy, and loss metrics.
The measure of accuracy is determined by dividing the total number of correct predictions (Developer Google, 2022) (i.e., accurate classifications) by the total number of data samples that were examined.The categorical cross entropy loss function is a loss function used in image classification tasks to measure the difference between predicted probability distribution and actual class labels.It minimizes the loss by taking the negative logarithm of the predicted probability of correct class assignment, promoting more likely correct class labels and is particularly effective in multi-class classification problems.

III. RESULTS AND DISCUSSION
In this section, we present a detailed analysis of the outcomes obtained through our experiments, shedding light on the performance metrics, including accuracy and loss, across a range of pretrained.These findings will be critically examined within the broader context of our research, allowing us to draw meaningful insights into the effectiveness of these models for sign language image classification.
To provide context for our analysis, we initially divided the dataset into training, validation, and testing sets, followed by preprocessing.Subsequently, pretrained models, namely ResNet50, MobileNetV4, EfficientNetB1, and InceptionV3, were downloaded with the 'include_ top=False' option, omitting the top classification layers.These models were then compiled using the Adam optimizer and categorical cross entropy loss function, employing accuracy as the evaluation metric.Finally, we executed the training process for 25 epochs to fine-tune the pretrained models specifically for the sign language image classification task.
In Table I and Figure

Figure 2 .
Figure 2. The Propose of Methods

Figure 3 .
Figure 3.The Example of Sign Language Dataset 3, we can observe the results of the model training process.ResNet50 demonstrated exceptional performance, achieving an impressive training accuracy of 99% and maintaining a high validation accuracy of 99%, with a slightly lower but still robust test accuracy of 98%.In contrast, EfficientNetB1 encountered challenges throughout the training process, resulting in a low training accuracy of 8%, an even lower validation accuracy of 3%, and the lowest test accuracy at 1%.On the other hand, MobileNetV4 delivered outstanding results with a remarkable training accuracy of 98%, a flawless validation accuracy of 100%, and a strong test accuracy of 99%.InceptionV3 also displayed strong performance, achieving a training accuracy of 96%, closely resembling ResNet50 with a 99% validation accuracy and a solid test accuracy of 99%.These findings underscore the varying effectiveness of pretrained models within the context of the classification task, with ResNet50 and MobileNetV4 emerging as the topperforming models.

Figure 3 .Figure 4 .
Figure 3. Accuracy History of Model Training

Table I .
Training Acc Results Table In Table II and Figure 4 , we can observe the performance of different pretrained models throughout the training and testing phases.Notably, ResNet50 achieved the lowest loss values, with a training loss of approximately 0.03, a validation loss of around 0.06, and a test loss of 0.06.Conversely, EfficientNetB1 struggled during training, exhibiting notably high loss values with a training loss of about 3.40, a validation loss of approximately 3.34, and a test loss of 3.33.InceptionV3 also displayed strong performance, with a training loss of roughly 0.12, a validation loss of approximately 0.01, and a test loss of 0.01.These loss values offer critical insights into how well these pretrained models generalize from the training data to the testing data, where lower loss values indicate better classification performance.

Table II .
Training Loss Results Table