Integrating Geospatial Big Data and Machine Learning for Village Level Rural Urban Classification: Evidence From Toba Regency

Authors

  • Meilani Thereza Br. Saragih BPS-Statistics Toba Regency
  • Nurlatifah Hartojo Education and Training Centre, Statistics Indonesia https://orcid.org/0000-0003-4973-3849

DOI:

https://doi.org/10.21512/emacsjournal.v8i1.14159

Keywords:

Village Status, Machine Learning , Random Forest, Big Data

Abstract

This study aims to develop a data-driven framework for classifying rural and urban areas at the village level in Toba Regency by integrating official statistical data, geospatial big data, and machine learning techniques. The current regional classification still relies on the 2020 baseline and may not adequately reflect recent socio-spatial transformations occurring at finer administrative levels. To address this limitation, this study integrates Village Potential Statistics (PODES) data with spatial indicators derived from big data sources, including population density from WorldPop and the Built-Up Index (BUI) extracted from satellite imagery. The integration of these datasets enables a more comprehensive representation of settlement patterns, spatial development intensity, and demographic distribution across villages. Three supervised machine learning algorithms were implemented to this study: Support Vector Machine (SVM), Naïve Bayes, and Random Forest, with model evaluation using accuracy, precision, recall, and F1-score. The analysis results show that the Random Forest algorithm provides the best performance. Based on the best model, of the 244 villages analyzed, 156 areas were classified as rural and 88 areas as urban. These results indicate a change in status in 47 villages compared to the previous classification. These findings indicate that integrating official statistical data with big data and machine learning methods can capture the dynamics of regional development more adaptively, potentially serving as a complementary approach for compiling regional classifications and formulating more targeted development policies.

Dimensions

References

Al-Sai, Z. A., Husin, M. H., Syed-Mohamad, S. M., Abdin, R. M. S., Damer, N., Abualigah, L., & Gandomi, A. H. (2022). Explore Big Data Analytics Applications and Opportunities: A Review. Big Data and Cognitive Computing, 6(157), 1–23. https://doi.org/10.3390/bdcc6040157

Alkire, S., Kanagaratman, U., & Suppa, N. (2021). The Global Multidimensional Poverty Index (MPI): 2021. OPHI MPI Methodological Note 51, 1–39. https://www.ophi.org.uk/wp-content/uploads/OPHI_MPI_MN_51_2021_4_2022.pdf

Almasah, M. Z., & Wijayanto, A. W. (2023). Comparison of Data Mining Methods in Classifying Village Status of Purwakarta and West Bandung Regencies (Podes 2021). Eigen Mathematics Journal, 6(1), 5–10. https://doi.org/10.29303/emj.v6i1.156

Apriliansyah, Pangestika, A., Ramadhanty, A. P., Putra, G. M., Putri, G. S. N. D. S., & Nooraeni, R. (2021). Classification of Village/Sub-district Status in Special Region of Yogyakarta Using the Decision Tree Model (Case Study of Field Work Practice Data of Politeknik Statitika STIS 2020). Engineering, MAthematics and Computer Science (EMACS) Journal, 3(1), 33–41. https://doi.org/10.21512/emacsjournal.v3i1.6787

Bates, S., Hastie, T., & Tibshirani, R. (2022). Cross-Validation: What Does It Estimate and How Well Does It Do It? Journal of the American Statistical Association, 1–43. https://doi.org/10.1080/01621459.2023.2197686

BPS-Statistics Sumatera Utara Province. (2025). Gini Ratio by Regency/Municipality in Sumatera Utara, 2024. https://sumut.bps.go.id/en/statistics-table/2/NDY3IzI=/gini-ratio-sumatera-utara-menurut-kabupaten-kota.html

Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 1–27. https://doi.org/10.1016/j.neucom.2019.10.118

Dicuonzo, G., Galeone, G., Shini, M., & Massari, A. (2022). Towards the Use of Big Data in Healthcare: A Literature Review. Healthcare, 10(1232), 1–16. https://doi.org/10.3390/healthcare10071232

Kulkarni, V. Y., & Sinha, P. K. (2014). Effective Learning and Classification using Random Forest Algorithm. International Journal of Engineering and Innovative Technology (IJEIT), 3(11), 267–273.

Li, Z. (2020). Geospatial Big Data Handling with High Performance Computing: Current Approaches and Future Directions. 53–76. https://doi.org/10.1007/978-3-030-47998-5_4

Mariyah, S., & Wobcke, W. (2025). Evaluating area-level features for proxy means test models: evidence from rural, semi-urban and urban districts in poverty targeting. Journal of Computational Social Science, 8(3). https://doi.org/10.1007/s42001-025-00405-8

Ministry of Villages and Development of Disadvantaged Regions of the Republic of Indonesia. (2025). Village Development Index (IDM) Status. https://idm.kemendesa.go.id/

Obi, J. C. (2023). A Comparative Study of Several Classification Metrics and Their Performances on Data. World Journal of Advanced Engineering Technology and Sciences, 08(01), 308–314. https://doi.org/10.30574/wjaets.2023.8.1.0054

Pramana, S., Yuniarto, B., Santoso, I., Nooraeni, R., & Suadaa, L. H. (2023). Data Mining with R Concepts and Implementation (2 ed.). In Media.

Presidential Regulation (Perpres) No. 12 of 2025 concerning the National Medium-Term Development Plan (RPJMN) 2025-2029., Pub. L. No. 12 of 2025 (2025).

Presidential Regulation (Perpres) No. 89 of 2024 concerning the Master Plan for the National Tourism Destination of Lake Toba for 2024 - 2044, Pub. L. No. 89 of 2024 (2024).

Primajaya, A., & Sari, B. N. (2018). Random Forest Algorithm for Prediction of Precipitation. Indonesian Journal of Artificial Intelligence and Data Mining (IJAIDM), 1(1), 27–31. https://doi.org/10.24014/ijaidm.v1i1.4903

Regulation of The Head of BPS-Statistics Indonesia Number 120 of 2020 Concerning Urban and Rural Village Classification in Indonesia 2020, Pub. L. No. 120 of 2020 (2020).

Schröer, C., Kruse, F., & Gómez, J. M. (2021). A systematic literature review on applying CRISP-DM process model. Procedia Computer Science, 181(2019), 526–534. https://doi.org/10.1016/j.procs.2021.01.199

Shi, K., Chang, Z., Chen, Z., Wu, J., & Yu, B. (2020). Identifying and evaluating poverty using multisource remote sensing and point of interest (POI) data: A case study of Chongqing, China. Journal of Cleaner Production, 255, 120245. https://doi.org/https://doi.org/10.1016/j.jclepro.2020.120245

Syahputra, H., & Wibowo, A. (2023). Comparison of Support Vector Machine (SVM) and Random Forest Algorithm for Detection of Negative Content on Websites. Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), 9(1), 165–173. https://doi.org/10.26555/jiteki.v9i1.25861

Tosi, D., Kokaj, R., & Roccetti, M. (2024). 15 years of Big Data: a systematic literature review. Journal of Big Data, 11(73), 1–39. https://doi.org/10.1186/s40537-024-00914-9

Wibawa, A. P., Kurniawan, A. C., Murti, D. M. P., Adiperkasa, R. P., Putra, S. M., Kurniawan, S. A., & Nugraha, Y. R. (2019). Naïve Bayes Classifier for Journal Quartile Classification. International Journal of Recent Contributions from Engineering, Science & IT (iJES), 7(2), 91. https://doi.org/10.3991/ijes.v7i2.10659

Yao, Y., Li, X., Liu, X., Liu, P., Liang, Z., Zhang, J., & Mai, K. (2017). Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. International Journal of Geographical Information Science, 31(4), 825–848. https://doi.org/https://doi.org/10.1080/13658816.2016.1244608

Yin, J., Dong, J., Hamm, N. A. S., Li, Z., Wang, J., Xing, H., & Fu, P. (2021). Integrating remote sensing and geospatial big data for urban land use mapping: A review. International Journal of Applied Earth Observation and Geoinformation, 103, 102514. https://doi.org/10.1016/j.jag.2021.102514

Zhang, P., Ke, Y., Zhang, Z., Wang, M., Li, P., & Zhang, S. (2018). Urban Land Use and Land Cover Classification Using Novel Deep Learning Models Based on High Spatial Resolution Satellite Imagery. Sensors, 18(11), 1–21. https://doi.org/10.3390/s18113717

Zhou, C., Li, F., Zhang, J., Zhao, J., Zhang, Y., & Wang, J. (2020). Analysis of Spatial and Temporal Variations of Vegetation Index in Liaodong Bay in the last 30 years based on the GEE Platform. IOP Conference Series: Earth and Environmental Science, 502, 1–8. https://doi.org/10.1088/1755-1315/502/1/012037

Downloads

Published

2026-01-31

How to Cite

Br. Saragih, M. T., & Hartojo, N. (2026). Integrating Geospatial Big Data and Machine Learning for Village Level Rural Urban Classification: Evidence From Toba Regency. Engineering, MAthematics and Computer Science Journal (EMACS), 8(1), 27–36. https://doi.org/10.21512/emacsjournal.v8i1.14159
Abstract 35  .
PDF downloaded 18  .