Implementation of Optical Character Recognition and Voice Recognition of House of Words (How) Dictionary Application on Android Platform

People with different languages need to be assisted by translator to establish the communication between them. The technology development which exists to fulfill communication needs is digital dictionary as translator tool. The capability of digital dictionary to translate the languages yet has a weakness in putting the input. Through this research, Optical Character Recognition using Tesseract library and Voice Recognition technologies using Google Speech-To-Text are used to replace the previous input system. Based on the implementation and testing, the OCR and Voice Recognition have been successfully recognizing the text and voice input with the amount of similarity of 92,72% for OCR and 95,46% for Voice Recognition. The result of the implementation is expected to help a group of people with different language to communicate easily.


I. INTRODUCTION
The development of technology in early decade has helped fulfilling the communication needs especially for different group of people with different languages. The example of communication technology development is digital dictionary as a translator tool. The existence of digital dictionary has replaced the existence of the conventional dictionary with its physical form. The digital dictionary easily support the communication needs between groups of people with different languages to translate not only a single word but also sentences without searching for the words manually through the alphabet.
The capability of current digital dictionary to translate the languages needs its user to type the input manually through keyboard, while the nature of language is a voice system and a tangible symbol [1]. Therefore, synchronizing the natural language concept with the current digital dictionary input is needed.
The current development of technology not only has helped fulfilling the communication needs but also has helped stimulating the growth of smartphone user. GSMA Intelligence, Worldometers, gather the data that 7.529 billion of people around the world are the active users of cellphone, while Research Institute of Digital Marketing, Emarketer, has estimated that there are more than 100 million of smartphone active users in Indonesia by 2018. The existence of smartphone as a mobile device opened up the opportunities for solving various problem in various fields through mobile application. Therefore, as the way to synchronize the current digital dictionary with the natural language concept and the growth of smartphone user, the mobile application of digital dictionary using Optical Characteristic Recognition and Voice Recognition for Android platform is made for supporting the current needs of communication between people by replacing the previous input process in the current digital dictionary.
The Optical Character Recognition (OCR) and Voice Recognition is used in the House of Words (HOW) Dictionary Application. The OCR is made by using Tesseract library and Leptonica Image while The Voice Recognition is made by using the Google Speech API as speech-to-text and text-to-speech tools, integrated with Google Translator API. Supporting the natural of language concept, OCR and Voice Recognition used in HOW Dictionary is intended to enable people to scan directly from text image with many special character or symbol and to be directly speech their words and/or sentences that will be translated.

Literature Review 1. Optical Character Recognition (OCR)
Optical Character Recognition (OCR) is reading activity of the typed character, computer printouts, or handwriting text from any regular documents then translate those image document to be a form which can be processed by computer. The simple definition of OCR is recognition process of printed text, handwriting text, typewriting text, etc., from hard copy document to digital document [2]. OCR is used for any activities related to image documentation such as book pages image, legitimate document, medical record, and so on.
Morphological difference in character forms can be difficult or impossible to detect, particularly in the presence of printing and scanning artifact. OCR result may contain segmentation as well as classification error due to low image quality [3]. The better quality of the input data, then the OCR performance will approach the human ability for recognizing the character, on the other side, OCR will be very difficult and even fail on doing any detection [4].
The OCR systems have some common steps that can be seen on Figure 1

Speech Recognition
Speech Recognition is defined as conversion process from sound signals into machine linguistic in the form of digital data (usually in the form of simple text). In the other words, speech recognition states the ability of pattern matching given from the vocabulary of sound signals into the right form. Another definition of speech recognition is a process where the computer (other types of machine) able to recognize the words spoken by human. This process is also called as interpretation of human speech in computer. The voice recognition programs are made for recognizing the vocabulary of words from pre-programmed words [6].

Tesseract
Google Tesseract is Optical Character Recognition (OCR) engine for any operating system which was originally developed by Hewlet Packard since 1985 to 1995 [7]. After 2006 Tesseract is released to be used freely, since the Tesseract is being developed widely by Google and released under Apache 2.0 license [7]. Smith in ICDAR Conference said that Tesseract is combining both matrix matching algorithm and feature extraction algorithm. Tesseract requires less training data and using both statistic classifier and adaptive classifier to enable the machine to train itself on the analyzed document [7]. Flow process of Tesseract engine shown in Figure 2.

Google Speech API
Google speech API or Google Voice search was released on 2008 in USA for some type of smartphone. Google Speech API is a framework developed by Google to recognize the voice, convert it into string (text) and insert it to Google search page until the result appears based on the voice input. Voice recognition is done in Google server, which means the input that is accepted by the Android device (smartphone) will be sent to Google server, then Google server will recognize and convert it into the text using Hidden Markov Model Algorithm. After that, the result of voice conversion will be inserted into Google search page, then Google server will send the search result to the Android device [9].
The search page interface provides all types of information and service in internet, but user must search through the result list first to get the desire result. Development of voice-based search has some advantages. The advantages will be described as follows [10]: 1. Allows to call certain application on terminal or web service. 2. Using the language processing technology and classification formula to automatically determine which categories belong to user speech and suggest the right application. 3. Combine layers that provide easy access to other applications in other categories related to speech.

Google Translator API
Google Translate is an online multi language translation service. Google uses Googlebot to support this service. Compared with Google Dictionary, Google Translate able to translate a whole page of a book at a time. Google gives the discretion to the user to edit the translation result according to the standards language rules [11].

Java Programming Language
Java is a programming language which can be run on any kind of computers and operating systems including mobile phones. Java is developed by Sun Microsystems and released in 1995. Java is a software technology which is categorized as multi-platform. In addition, Java is a platform which has virtual machine and library needed for writing and running a program [12].
The advantages of Java programming language will be described as follows: I n P r e s s 1. The main advantage of Java programming language is it can be run on multiple platform or computer operating system, along with the principle 'writing once, run anywhere'.
2. Object Oriented Programming (OOP): Java is one of the natural object-oriented programming languages. All data types are being inherited from the base class called Object.
3. The completeness of the library and the existence of a great Java community which is constantly making new libraries for covering the needs of application developer.
4. Automatic Garbage Collection: Java has memory usage control so that the programmers do not need to directly set the memory [12].

Requirement Definition
House of Words (HOW) dictionary is a digital dictionary application using Optical Character Recognition and Voice Recognition technology which allows the user to give an input in uncommon ways. User can easily give an input by scanning the text contained in an image or document, or they can simply speech the words or sentence to the application. Those input will be processed by the application to give the user a translated words or sentence from what they have given before. The output is divided into two types, the first one is a translated text which can be saved as .txt file, and the second one is performed as sound which allows the application to speech the translated word or sentence to the user.
To use the application, user must give an image or voice as an input. While user give an input, the application will catch the input, if it is voice, the input will be filtrated first through microphone, and if it is an image the application will get it from camera or gallery. Next, the image input will be continued to cropping process. After finishing those steps, the user must choose the origin language of a text in OCR feature or both origin language and translation language of a voice in Voice Recognition feature, this step require Google Translator API integration to get library language. In OCR feature, before the text can immediately be translated, the user must choose the segmentation mode of OCR they want to use, then the image will be continued to image processing stage using tesseract engine. After that, both OCR feature and Voice Recognition feature can be continued to translation process. In translation process, both features are integrated with Google Translator API. The result of both processes can be spoken out as the application integrated with Google Speech-To-Text API and Text-To-Speech.

Application Design
Activity diagram is used for describing the workflow and activity running between user and system while user use House of Words (HOW) Dictionary application. It is used to describe about how the system starts, the decision which happen during the process, and how the system stop.
The activity diagram of OCR feature can be seen on Figure 3. User must choose the way they want to give an input image. It can be from device camera or device gallery. After giving an input, the input will be continued to cropping process, the user can adjust the text lied in the image. Next, the user will be brought to segmentation-mode page. At this stage, the user can see the cropped image and select the segmentation mode they want to use in the OCR process along with the origin language of the text lied in image. Once the user chooses to process the image with its selected mode and origin language, the image will be processed using Tesseract engine. Last, the result will come up along with the other features in use text page.
The other features lied in OCR function is translation feature. Activity diagram of translation in OCR feature can be seen on Figure 4.  The other feature of the application is Voice Recognition. The user can access this feature through main menu page. First, user must choose the origin and translation language before speaking up the words/ sentences. Once the words is spoken up to the application the application will catch the sound and shows the list of suggestion words to the user. The most accurate word is put at the top of the list. Next, the user selects the most correct words/ sentences with the words/ sentences they have spoken before.
The activity diagram of translation in voice recognition feature can be seen on Figure 6.

Implementation
The main menu page contains all the main feature of the application, which are OCR feature and Voice Recognition feature. Through this page, the user can start to translate using one of the main features. The main menu page itself contains 4 menus, which are OCR menu, Voice menu, Guide Menu, and About Menu. Main Menu Page can be seen on Figure 8.  The users have to give an input by clicking the camera button or gallery button on OCR Page. Those buttons will bring user to two different ways for giving image input. If the camera is chosen, then the application will activate the camera on user device so that make the users able to take the picture directly. Camera feature can be seen on Figure 10. The picture is taken with the flash mode on to make sure that the brightness is enough for making the text readable by the tesseract engine. After the picture is taken, then the application will ask you whether you want to retry taking the picture or not. It can be seen on Figure 11.  Crop Image page enables user to adjust the text in the picture that has been taken before. At this implementation, PENDAHULUAN is the text that will be taken to be translated to English language. If the adjustment is finish, then the user can click on cross button or check button. Cross button is intended to cancel the OCR process while in crop image page. Check button is intended to bring the user to the next page in OCR feature, which is The Segmentation Mode page that can be seen on Figure 13. There are two symbols to select the mode of OCR process, the globe symbol, and list symbol. Globe symbol contained language list and used for selecting the origin language of the cropped image, and list symbol contains the segmentation mode which is used for processing the image. The list language can be seen on Figure 14.  After selecting the origin language and segmentation mode. If user click on the check symbol, then the picture will be processed by the application and the use text page will appears along with the OCR result in Figure 16. Voice Recognition Page appears when user click on Voice Button at Main Menu Page. Voice Recognition Page contains some features inside such as spinner, button, image view, and text view. Spinner is divided into two types, the Spinner 1 is intended to give a list of languages which can be used as the original or source language used for translation, while the Spinner 2 is intended to give a list of languages which can be used as the destination language for translation. Reverse button is intended to swap the source language in Spinner 1 with destination language in Spinner 2. Speech button is intended to give a voice as an input which will be converted into text, then the converted text will be translated, after that the translated text will be spoken out by the application. Voice Recognition Page can be seen on Figure 17.

TESTING
Testing is divided into 2 types, which are OCR Testing and Voice Recognition Testing. The OCR testing is done by using 40 pictures and Voice Recognition testing is done by using 15 voices. The picture category itself is divided to 20 pictures with different variety of background and text colors, 10 pictures with different letter of alphabet (script), 10 pictures with different number of characters, and 15 voices with different sp oken languages and different number of spoken words.

OCR Testing
OCR testing is divided into 3 types of image testing. Those are, different background and font colors which can be seen on Figure 20, different letter of alphabet (script) which can be seen on Figure 21, and different number of characters which can be seen on Figure 22.

Voice Recognition Testing
Google Text-To-Speech engine is used to speak up every tested word in this voice recognition testing. The result of this testing can be seen on Figure 23.

TESTING RESULT 3.3.1 OCR Testing Result
The similarity percentage is obtained from Levenshtein Distance equation for calculating the similarity between actual text in the image and OCR testing result. All similarity percentage are averaged using Mean Arithmetic Formula. Mean Arithmetic, or simply called as Mean, of a set of N numbers , can be declared by , as well as defined as [13]: The result of OCR similarity test can be seen on Figure 24. There are 3 of 20 background colors which cannot be read, two of them is because of the green background used in the image, and the other one is because of the yellow color of the font. On different background colors and font colors type of testing, it found out that the fuscia background used in one of the input images gave a minimum similarity, which is only 3%. The percentage of different number of character testing is on the second higher value on OCR similarity testing, it is 98%. According to the data test, there is one of the output images whose string character total is 8 and using Japanese language which has 78% similarity with its input. The last which hold the highest similarity percentage is similarity using nine different languages with 10 test images. All the percentage obtained in OCR similarity testing is calculated to find the grand percentage of the similarity using the same formula, Mean Arithmetic, and the result can be seen on Figure 25.

Voice Recognition Test Result
According to the test which has been done, it is encountered that the percentage of voice recognition similarity is 95%. One of the tested voices using French is only has 42% out of 100%, and the other one using Russian has 90% out of 100%. Grand percentage of voice recognition test result can be seen on Figure 26.

CONLUSION
Based on the implementation and testing which have been done on this research, some conclusions can be drawn, include: 1. Implementation of optical character recognition and voice recognition of House of Words (HOW) Dictionary Application on Android platform have been successfuly done with average similarity between OCR input and output is 92,72% and average similarity of Voice Recognition is 95,46% which is encountered from Levenshtein Distance, and being averaged by Mean Arithmetic.
2. Tesseract can recognize the optical character with not dominant background colour, and Google Speech-To-Text and Google Text-To-Speech can recognize voice.

FUTURE WORKS
Suggestions for further development and refinement of Optical Character Recognition and Voice Recognition research are as follows: 1. Adding the latest and sophisticated image processing algorithm in Optical Character Recognition feature which can improve image input quality and reduce noises in order to read all colors and alphabet.
2. Adding voice filtering technology or algorithm to improve the voice control quality in voice recognition feature.
3. Add more languages option in voice recognition feature.

4.
Create the offline one on other platform.