Indonesian-English Textual Similarity Detection Using Universal Sentence Encoder (USE) and Facebook AI Similarity Search (FAISS)

Authors

  • Lucia D. Krisnawati Universitas Kristen Duta Wacana
  • Aditya W. Mahastama Universitas Kristen Duta Wacana
  • Su-Cheng Haw Multimedia University
  • Kok-Why Ng Multimedia University
  • Palanichamy Naveen Multimedia University

DOI:

https://doi.org/10.21512/commit.v18i2.11274

Keywords:

Textual Similarity Detection, Universal Sentence Encoder (USE), Facebook AI Similarity Search (FAISS)

Abstract

The tremendous development in Natural Language Processing (NLP) has enabled the detection of bilingual and multilingual textual similarity. One of the main challenges of the Textual Similarity Detection (TSD) system lies in learning effective text representation. The research focuses on identifying similar texts between Indonesian and English across a broad range of semantic similarity spectrums. The primary challenge is generating English and Indonesian dense vector representation, a.k.a. embeddings that share a single vector space. Through trial and error, the research proposes using the Universal Sentence Encoder (USE) model to construct bilingual embeddings and FAISS to index the bilingual dataset. The comparison between query vectors and index vectors is done using two approaches: the heuristic comparison with Euclidian distance and a clustering algorithm, Approximate Nearest Neighbors (ANN). The system is tested with four different semantic granularities, two text granularities, and evaluation metrics with a cutoff value of k={2,10}. Four semantic granularities used are highly similar or near duplicate, Semantic Entailment (SE), Topically Related (TR), and Out of Topic (OOT), while the text granularities take on the sentence and paragraph levels. The experimental results demonstrate that the proposed system successfully ranks similar texts in different languages within the top ten. It has been proven by the highest F1@2 score of 0.96 for the near duplicate category on the sentence level. Unlike the near-duplicate category, the highest F1 scores of 0.77 and 0.89 are shown by the SE and TR categories, respectively. The experiment results also show a high correlation between text and semantic granularity.

Dimensions

Plum Analytics

Author Biographies

Lucia D. Krisnawati, Universitas Kristen Duta Wacana

Informatics Department, Faculty of Information Technology

Aditya W. Mahastama, Universitas Kristen Duta Wacana

Informatics Department, Faculty of Information Technology

Su-Cheng Haw, Multimedia University

Faculty of Computing and Informatics

Kok-Why Ng, Multimedia University

Faculty of Computing and Informatics

Palanichamy Naveen, Multimedia University

Faculty of Computing and Informatics

References

K. J. Ottenstein, “An algorithmic approach to the detection and prevention of plagiarism,” ACM SIGCSE Bulletin, vol. 8, pp. 30–41, 1976.

L. D. Krisnawati, “Plagiarism detection for Indonesian texts,” PhD Thesis, Faculty for Languages and Literatures, Ludwig-Maximilian University, 2016. [Online]. Available: https://edoc.ub.uni-muenchen.de/19823/

L. D. Krisnawati, J. F. Lim, and G. Virginia, “Penggunaan pemodelan topik dalam sistem temu kembali dokumen termirip,” Jurnal Linguistik Komputasional, vol. 6, no. 3, pp. 1–10, 2023.

W. X. Zhao, J. Liu, R. Ren, and J. R.Wen, “Dense text retrieval based on pretrained language models: A survey,” ACM Transactions on Information Systems, vol. 42, no. 2, pp. 1–60, 2024.

N. C. Haryanto, L. D. Krisnawati, and A. R. Chrismanto, “Temu kembali dokumen sumber rujukan dalam sistem daur ulang teks,” Jurnal Teknologi dan Sistem Komputer, vol. 8, no. 2, pp. 140–149, 2020.

J. Lin, “A proposed conceptual framework for a representational approach to information retrieval,” ACM SIGIR Forum, vol. 55, no. 2, pp. 1–29, 2022.

A. Yates, R. Nogueira, and J. Lin, “Pretrained transformers for text ranking: BERT and beyond,” in Proceedings of the 14th ACM International Conference on Web Search and Data Mining. Virtual Event Israel: Association for Computing Machinery, March 8–12, 2021, pp. 1154–1156.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems 26 (NIPS 2013), 2013, pp. 3111–3119.

V. Karpukhin, B. O˘guz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. T. Yih, “Dense passage retrieval for open-domain question answering,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics, Nov. 2020, pp. 6769–6781.

A. Mundher, K. Khater, and L. M. Ganeem, “Adopting text similarity methods and cloud computing to build a college chatbot model,” Journal of Education and Science, vol. 30, no. 1, pp. 117–125, 2021.

L. Gienapp, W. Kircheis, B. Sievers, B. Stein, and M. Potthast, “A large dataset of scientific text reuse in open-access publications,” Scientific Data, vol. 10, pp. 1–11, 2023.

N. Ghasemi and S. Momtazi, “Neural text similarity of user reviews for improving collaborative filtering recommender systems,” Electrononic Commerce Research and Appllications, vol. 45, 2021.

O. Karnalim, S. Budi, H. Toba, and M. Joy, “Source code plagiarism detection in academia with information retrieval: Dataset and the observation,” Informatics in Education, vol. 18, no. 2, pp. 321–344, 2019.

M. Chen and Y. Dong, “Design of exercise grading system based on text similarity computing,” Mobile Information Systems, vol. 2022, no. 1, pp. 1–7, 2022.

M. R. R. Susanto, H. Thamrin, and N. A. Verdikha, “Performance of text similarity algorithms for essay answer scoring in online examinations,” Jurnal Teknologi Informasi (JUTIF), vol. 4, no. 6, pp. 1515 – 1521, 2023.

M. Oppermann, R. Kincaid, and T. Munzner, “VizCommender: Computing text-based similarity in visualization repositories for content-based recommendations,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 2, p. 495–505, 2021.

J. Guo, Y. Cai, Y. Fan, F. Sun, R. Zhang, and X. Cheng, “Semantic models for the first-stage retrieval: A comprehensive review,” ACM Transactions on Information Systems (TOIS), vol. 40, no. 4, pp. 1–42, 2022.

P. Nie, Y. Zhang, X. Geng, A. Ramamurthy, L. Song, and D. Jiang, “DC-BERT: Decoupling question and document for efficient contextual encoding,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event China, July 25–30, 2020, pp. 1829–1832.

J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” 2014.

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.

J. P. Sanjanasri, V. K. Menon, K. P. Soman, S. Rajendran, and A. Wolk, “Generation of crosslingual word vectors for low-resourced languages using deep learning and topological metrics in a data-efficient way,” Electronics, vol. 10, no. 12, pp. 1–23, 2021.

S. Ruder, I. Vuli´c, and A. Søgaard, “A survey of cross-lingual word embedding models,” Journal of Artificial Intelligence Research, vol. 65, pp. 569–630, 2019.

M. Niyogi, K. Ghosh, and A. Bhattacharya, “Learning multilingual embeddings for crosslingual information retrieval in the presence of topically aligned corpora,” 2018. [Online]. Available: https://arxiv.org/abs/1804.04475

Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y. H. Sung, B. Strope, and R. Kurzweil, “Multilingual universal sentence encoder for semantic retrieval,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, Jul. 2020, pp. 87–94.

Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, M. Yang, and J. Wang, “SPANN: Highly-efficient billion-scale approximate nearest neighborhood search,” Advances in Neural Information Processing Systems, vol. 34, pp. 5199–5212, 2021.

J. Johnson, M. Douze, and H. J´egou, “Billionscale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019.

Y. Arifin, S. M. Isa, L. A. Wulandari, and E. Abdurachman, “Developing a bilingual model of word embedding for detecting Indonesian English plagiarism,” Journal of Theoretical and Applied Information Technology, vol. 99, no. 17, pp. 4388–4348, 2021.

N. R. Ramadhanti and S. Mariyah, “Document similarity detection using Indonesian language Word2Vec model,” in 2019 3rd International Conference on Informatics and Computational Sciences (ICICoS). Semarang, Indonesia: IEEE, Oct. 29–30, 2019.

A. A. P. Ratna, P. D. Purnamasari, B. A. Adhi, F. A. Ekadiyanto, M. Salman, M. Mardiyah, and D. J. Winata, “Cross-language plagiarism detection system using latent semantic analysis and learning vector quantization,” Algorithms, vol. 10, no. 2, pp. 1–14, 2018.

D. Cer, Y. Yang, S. Y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil, “Universal sentence encoder for English,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussel, Belgium, Nov. 2018, pp. 169–174.

J. Briggs, “FAISS: The missing manual.” [Online]. Available: https://www.pinecone.io/learn/series/faiss

Downloads

Published

2024-09-10
Abstract 416  .
PDF downloaded 218  .