Indonesian-English Textual Similarity Detection Using Universal Sentence Encoder (USE) and Facebook AI Similarity Search (FAISS)

Lucia D. Krisnawati; Aditya W. Mahastama; Su-Cheng Haw; Kok-Why Ng; Palanichamy Naveen

doi:10.21512/commit.v18i2.11274

Authors

Lucia D. Krisnawati Universitas Kristen Duta Wacana
Aditya W. Mahastama Universitas Kristen Duta Wacana
Su-Cheng Haw Multimedia University
Kok-Why Ng Multimedia University
Palanichamy Naveen Multimedia University

DOI:

https://doi.org/10.21512/commit.v18i2.11274

Keywords:

Textual Similarity Detection, Universal Sentence Encoder (USE), Facebook AI Similarity Search (FAISS)

Abstract

The tremendous development in Natural Language Processing (NLP) has enabled the detection of bilingual and multilingual textual similarity. One of the main challenges of the Textual Similarity Detection (TSD) system lies in learning effective text representation. The research focuses on identifying similar texts between Indonesian and English across a broad range of semantic similarity spectrums. The primary challenge is generating English and Indonesian dense vector representation, a.k.a. embeddings that share a single vector space. Through trial and error, the research proposes using the Universal Sentence Encoder (USE) model to construct bilingual embeddings and FAISS to index the bilingual dataset. The comparison between query vectors and index vectors is done using two approaches: the heuristic comparison with Euclidian distance and a clustering algorithm, Approximate Nearest Neighbors (ANN). The system is tested with four different semantic granularities, two text granularities, and evaluation metrics with a cutoff value of k={2,10}. Four semantic granularities used are highly similar or near duplicate, Semantic Entailment (SE), Topically Related (TR), and Out of Topic (OOT), while the text granularities take on the sentence and paragraph levels. The experimental results demonstrate that the proposed system successfully ranks similar texts in different languages within the top ten. It has been proven by the highest F1@2 score of 0.96 for the near duplicate category on the sentence level. Unlike the near-duplicate category, the highest F1 scores of 0.77 and 0.89 are shown by the SE and TR categories, respectively. The experiment results also show a high correlation between text and semantic granularity.

Dimensions

Plum Analytics

Author Biographies

Lucia D. Krisnawati, Universitas Kristen Duta Wacana

Informatics Department, Faculty of Information Technology

Aditya W. Mahastama, Universitas Kristen Duta Wacana

Informatics Department, Faculty of Information Technology

Su-Cheng Haw, Multimedia University

Faculty of Computing and Informatics

Kok-Why Ng, Multimedia University

Faculty of Computing and Informatics

Palanichamy Naveen, Multimedia University

Faculty of Computing and Informatics

References

K. J. Ottenstein, â€œAn algorithmic approach to the detection and prevention of plagiarism,â€ ACM SIGCSE Bulletin, vol. 8, pp. 30â€“41, 1976.

L. D. Krisnawati, â€œPlagiarism detection for Indonesian texts,â€ PhD Thesis, Faculty for Languages and Literatures, Ludwig-Maximilian University, 2016. [Online]. Available: https://edoc.ub.uni-muenchen.de/19823/

L. D. Krisnawati, J. F. Lim, and G. Virginia, â€œPenggunaan pemodelan topik dalam sistem temu kembali dokumen termirip,â€ Jurnal Linguistik Komputasional, vol. 6, no. 3, pp. 1â€“10, 2023.

W. X. Zhao, J. Liu, R. Ren, and J. R.Wen, â€œDense text retrieval based on pretrained language models: A survey,â€ ACM Transactions on Information Systems, vol. 42, no. 2, pp. 1â€“60, 2024.

N. C. Haryanto, L. D. Krisnawati, and A. R. Chrismanto, â€œTemu kembali dokumen sumber rujukan dalam sistem daur ulang teks,â€ Jurnal Teknologi dan Sistem Komputer, vol. 8, no. 2, pp. 140â€“149, 2020.

J. Lin, â€œA proposed conceptual framework for a representational approach to information retrieval,â€ ACM SIGIR Forum, vol. 55, no. 2, pp. 1â€“29, 2022.

A. Yates, R. Nogueira, and J. Lin, â€œPretrained transformers for text ranking: BERT and beyond,â€ in Proceedings of the 14th ACM International Conference on Web Search and Data Mining. Virtual Event Israel: Association for Computing Machinery, March 8â€“12, 2021, pp. 1154â€“1156.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, â€œDistributed representations of words and phrases and their compositionality,â€ in Advances in Neural Information Processing Systems 26 (NIPS 2013), 2013, pp. 3111â€“3119.

V. Karpukhin, B. OË˜guz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. T. Yih, â€œDense passage retrieval for open-domain question answering,â€ in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics, Nov. 2020, pp. 6769â€“6781.

A. Mundher, K. Khater, and L. M. Ganeem, â€œAdopting text similarity methods and cloud computing to build a college chatbot model,â€ Journal of Education and Science, vol. 30, no. 1, pp. 117â€“125, 2021.

L. Gienapp, W. Kircheis, B. Sievers, B. Stein, and M. Potthast, â€œA large dataset of scientific text reuse in open-access publications,â€ Scientific Data, vol. 10, pp. 1â€“11, 2023.

N. Ghasemi and S. Momtazi, â€œNeural text similarity of user reviews for improving collaborative filtering recommender systems,â€ Electrononic Commerce Research and Appllications, vol. 45, 2021.

O. Karnalim, S. Budi, H. Toba, and M. Joy, â€œSource code plagiarism detection in academia with information retrieval: Dataset and the observation,â€ Informatics in Education, vol. 18, no. 2, pp. 321â€“344, 2019.

M. Chen and Y. Dong, â€œDesign of exercise grading system based on text similarity computing,â€ Mobile Information Systems, vol. 2022, no. 1, pp. 1â€“7, 2022.

M. R. R. Susanto, H. Thamrin, and N. A. Verdikha, â€œPerformance of text similarity algorithms for essay answer scoring in online examinations,â€ Jurnal Teknologi Informasi (JUTIF), vol. 4, no. 6, pp. 1515 â€“ 1521, 2023.

M. Oppermann, R. Kincaid, and T. Munzner, â€œVizCommender: Computing text-based similarity in visualization repositories for content-based recommendations,â€ IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 2, p. 495â€“505, 2021.

J. Guo, Y. Cai, Y. Fan, F. Sun, R. Zhang, and X. Cheng, â€œSemantic models for the first-stage retrieval: A comprehensive review,â€ ACM Transactions on Information Systems (TOIS), vol. 40, no. 4, pp. 1â€“42, 2022.

P. Nie, Y. Zhang, X. Geng, A. Ramamurthy, L. Song, and D. Jiang, â€œDC-BERT: Decoupling question and document for efficient contextual encoding,â€ in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event China, July 25â€“30, 2020, pp. 1829â€“1832.

J. Pennington, R. Socher, and C. D. Manning, â€œGloVe: Global vectors for word representation,â€ 2014.

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, â€œEnriching word vectors with subword information,â€ Transactions of the Association for Computational Linguistics, vol. 5, pp. 135â€“146, 2017.

J. P. Sanjanasri, V. K. Menon, K. P. Soman, S. Rajendran, and A. Wolk, â€œGeneration of crosslingual word vectors for low-resourced languages using deep learning and topological metrics in a data-efficient way,â€ Electronics, vol. 10, no. 12, pp. 1â€“23, 2021.

S. Ruder, I. VuliÂ´c, and A. SÃ¸gaard, â€œA survey of cross-lingual word embedding models,â€ Journal of Artificial Intelligence Research, vol. 65, pp. 569â€“630, 2019.

M. Niyogi, K. Ghosh, and A. Bhattacharya, â€œLearning multilingual embeddings for crosslingual information retrieval in the presence of topically aligned corpora,â€ 2018. [Online]. Available: https://arxiv.org/abs/1804.04475

Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y. H. Sung, B. Strope, and R. Kurzweil, â€œMultilingual universal sentence encoder for semantic retrieval,â€ in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, Jul. 2020, pp. 87â€“94.

Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, M. Yang, and J. Wang, â€œSPANN: Highly-efficient billion-scale approximate nearest neighborhood search,â€ Advances in Neural Information Processing Systems, vol. 34, pp. 5199â€“5212, 2021.

J. Johnson, M. Douze, and H. JÂ´egou, â€œBillionscale similarity search with GPUs,â€ IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535â€“547, 2019.

Y. Arifin, S. M. Isa, L. A. Wulandari, and E. Abdurachman, â€œDeveloping a bilingual model of word embedding for detecting Indonesian English plagiarism,â€ Journal of Theoretical and Applied Information Technology, vol. 99, no. 17, pp. 4388â€“4348, 2021.

N. R. Ramadhanti and S. Mariyah, â€œDocument similarity detection using Indonesian language Word2Vec model,â€ in 2019 3rd International Conference on Informatics and Computational Sciences (ICICoS). Semarang, Indonesia: IEEE, Oct. 29â€“30, 2019.

A. A. P. Ratna, P. D. Purnamasari, B. A. Adhi, F. A. Ekadiyanto, M. Salman, M. Mardiyah, and D. J. Winata, â€œCross-language plagiarism detection system using latent semantic analysis and learning vector quantization,â€ Algorithms, vol. 10, no. 2, pp. 1â€“14, 2018.

D. Cer, Y. Yang, S. Y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil, â€œUniversal sentence encoder for English,â€ in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussel, Belgium, Nov. 2018, pp. 169â€“174.

J. Briggs, â€œFAISS: The missing manual.â€ [Online]. Available: https://www.pinecone.io/learn/series/faiss