Indonesian-English Textual Similarity Detection Using Universal Sentence Encoder (USE) and Facebook AI Similarity Search (FAISS)
DOI:
https://doi.org/10.21512/commit.v18i2.11274Keywords:
Textual Similarity Detection, Universal Sentence Encoder (USE), Facebook AI Similarity Search (FAISS)Abstract
The tremendous development in Natural Language Processing (NLP) has enabled the detection of bilingual and multilingual textual similarity. One of the main challenges of the Textual Similarity Detection (TSD) system lies in learning effective text representation. The research focuses on identifying similar texts between Indonesian and English across a broad range of semantic similarity spectrums. The primary challenge is generating English and Indonesian dense vector representation, a.k.a. embeddings that share a single vector space. Through trial and error, the research proposes using the Universal Sentence Encoder (USE) model to construct bilingual embeddings and FAISS to index the bilingual dataset. The comparison between query vectors and index vectors is done using two approaches: the heuristic comparison with Euclidian distance and a clustering algorithm, Approximate Nearest Neighbors (ANN). The system is tested with four different semantic granularities, two text granularities, and evaluation metrics with a cutoff value of k={2,10}. Four semantic granularities used are highly similar or near duplicate, Semantic Entailment (SE), Topically Related (TR), and Out of Topic (OOT), while the text granularities take on the sentence and paragraph levels. The experimental results demonstrate that the proposed system successfully ranks similar texts in different languages within the top ten. It has been proven by the highest F1@2 score of 0.96 for the near duplicate category on the sentence level. Unlike the near-duplicate category, the highest F1 scores of 0.77 and 0.89 are shown by the SE and TR categories, respectively. The experiment results also show a high correlation between text and semantic granularity.
Plum Analytics
References
K. J. Ottenstein, “An algorithmic approach to the detection and prevention of plagiarism,” ACM SIGCSE Bulletin, vol. 8, pp. 30–41, 1976.
L. D. Krisnawati, “Plagiarism detection for Indonesian texts,” PhD Thesis, Faculty for Languages and Literatures, Ludwig-Maximilian University, 2016. [Online]. Available: https://edoc.ub.uni-muenchen.de/19823/
L. D. Krisnawati, J. F. Lim, and G. Virginia, “Penggunaan pemodelan topik dalam sistem temu kembali dokumen termirip,” Jurnal Linguistik Komputasional, vol. 6, no. 3, pp. 1–10, 2023.
W. X. Zhao, J. Liu, R. Ren, and J. R.Wen, “Dense text retrieval based on pretrained language models: A survey,” ACM Transactions on Information Systems, vol. 42, no. 2, pp. 1–60, 2024.
N. C. Haryanto, L. D. Krisnawati, and A. R. Chrismanto, “Temu kembali dokumen sumber rujukan dalam sistem daur ulang teks,” Jurnal Teknologi dan Sistem Komputer, vol. 8, no. 2, pp. 140–149, 2020.
J. Lin, “A proposed conceptual framework for a representational approach to information retrieval,” ACM SIGIR Forum, vol. 55, no. 2, pp. 1–29, 2022.
A. Yates, R. Nogueira, and J. Lin, “Pretrained transformers for text ranking: BERT and beyond,” in Proceedings of the 14th ACM International Conference on Web Search and Data Mining. Virtual Event Israel: Association for Computing Machinery, March 8–12, 2021, pp. 1154–1156.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems 26 (NIPS 2013), 2013, pp. 3111–3119.
V. Karpukhin, B. O˘guz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. T. Yih, “Dense passage retrieval for open-domain question answering,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics, Nov. 2020, pp. 6769–6781.
A. Mundher, K. Khater, and L. M. Ganeem, “Adopting text similarity methods and cloud computing to build a college chatbot model,” Journal of Education and Science, vol. 30, no. 1, pp. 117–125, 2021.
L. Gienapp, W. Kircheis, B. Sievers, B. Stein, and M. Potthast, “A large dataset of scientific text reuse in open-access publications,” Scientific Data, vol. 10, pp. 1–11, 2023.
N. Ghasemi and S. Momtazi, “Neural text similarity of user reviews for improving collaborative filtering recommender systems,” Electrononic Commerce Research and Appllications, vol. 45, 2021.
O. Karnalim, S. Budi, H. Toba, and M. Joy, “Source code plagiarism detection in academia with information retrieval: Dataset and the observation,” Informatics in Education, vol. 18, no. 2, pp. 321–344, 2019.
M. Chen and Y. Dong, “Design of exercise grading system based on text similarity computing,” Mobile Information Systems, vol. 2022, no. 1, pp. 1–7, 2022.
M. R. R. Susanto, H. Thamrin, and N. A. Verdikha, “Performance of text similarity algorithms for essay answer scoring in online examinations,” Jurnal Teknologi Informasi (JUTIF), vol. 4, no. 6, pp. 1515 – 1521, 2023.
M. Oppermann, R. Kincaid, and T. Munzner, “VizCommender: Computing text-based similarity in visualization repositories for content-based recommendations,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 2, p. 495–505, 2021.
J. Guo, Y. Cai, Y. Fan, F. Sun, R. Zhang, and X. Cheng, “Semantic models for the first-stage retrieval: A comprehensive review,” ACM Transactions on Information Systems (TOIS), vol. 40, no. 4, pp. 1–42, 2022.
P. Nie, Y. Zhang, X. Geng, A. Ramamurthy, L. Song, and D. Jiang, “DC-BERT: Decoupling question and document for efficient contextual encoding,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event China, July 25–30, 2020, pp. 1829–1832.
J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” 2014.
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
J. P. Sanjanasri, V. K. Menon, K. P. Soman, S. Rajendran, and A. Wolk, “Generation of crosslingual word vectors for low-resourced languages using deep learning and topological metrics in a data-efficient way,” Electronics, vol. 10, no. 12, pp. 1–23, 2021.
S. Ruder, I. Vuli´c, and A. Søgaard, “A survey of cross-lingual word embedding models,” Journal of Artificial Intelligence Research, vol. 65, pp. 569–630, 2019.
M. Niyogi, K. Ghosh, and A. Bhattacharya, “Learning multilingual embeddings for crosslingual information retrieval in the presence of topically aligned corpora,” 2018. [Online]. Available: https://arxiv.org/abs/1804.04475
Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y. H. Sung, B. Strope, and R. Kurzweil, “Multilingual universal sentence encoder for semantic retrieval,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, Jul. 2020, pp. 87–94.
Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, M. Yang, and J. Wang, “SPANN: Highly-efficient billion-scale approximate nearest neighborhood search,” Advances in Neural Information Processing Systems, vol. 34, pp. 5199–5212, 2021.
J. Johnson, M. Douze, and H. J´egou, “Billionscale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019.
Y. Arifin, S. M. Isa, L. A. Wulandari, and E. Abdurachman, “Developing a bilingual model of word embedding for detecting Indonesian English plagiarism,” Journal of Theoretical and Applied Information Technology, vol. 99, no. 17, pp. 4388–4348, 2021.
N. R. Ramadhanti and S. Mariyah, “Document similarity detection using Indonesian language Word2Vec model,” in 2019 3rd International Conference on Informatics and Computational Sciences (ICICoS). Semarang, Indonesia: IEEE, Oct. 29–30, 2019.
A. A. P. Ratna, P. D. Purnamasari, B. A. Adhi, F. A. Ekadiyanto, M. Salman, M. Mardiyah, and D. J. Winata, “Cross-language plagiarism detection system using latent semantic analysis and learning vector quantization,” Algorithms, vol. 10, no. 2, pp. 1–14, 2018.
D. Cer, Y. Yang, S. Y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil, “Universal sentence encoder for English,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussel, Belgium, Nov. 2018, pp. 169–174.
J. Briggs, “FAISS: The missing manual.” [Online]. Available: https://www.pinecone.io/learn/series/faiss
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Lucia D. Krisnawati, Aditya W. Mahastama, Su-Cheng Haw, Kok-Why Ng, Palanichamy Naveen
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: Creative Commons Attribution-Share Alike (CC BY-SA)