Integrating Geospatial Big Data and Machine Learning for Village Level Rural Urban Classification: Evidence From Toba Regency
DOI:
https://doi.org/10.21512/emacsjournal.v8i1.14159Keywords:
Village Status, Machine Learning , Random Forest, Big DataAbstract
This study aims to develop a data-driven framework for classifying rural and urban areas at the village level in Toba Regency by integrating official statistical data, geospatial big data, and machine learning techniques. The current regional classification still relies on the 2020 baseline and may not adequately reflect recent socio-spatial transformations occurring at finer administrative levels. To address this limitation, this study integrates Village Potential Statistics (PODES) data with spatial indicators derived from big data sources, including population density from WorldPop and the Built-Up Index (BUI) extracted from satellite imagery. The integration of these datasets enables a more comprehensive representation of settlement patterns, spatial development intensity, and demographic distribution across villages. Three supervised machine learning algorithms were implemented to this study: Support Vector Machine (SVM), Naïve Bayes, and Random Forest, with model evaluation using accuracy, precision, recall, and F1-score. The analysis results show that the Random Forest algorithm provides the best performance. Based on the best model, of the 244 villages analyzed, 156 areas were classified as rural and 88 areas as urban. These results indicate a change in status in 47 villages compared to the previous classification. These findings indicate that integrating official statistical data with big data and machine learning methods can capture the dynamics of regional development more adaptively, potentially serving as a complementary approach for compiling regional classifications and formulating more targeted development policies.
References
Al-Sai, Z. A., Husin, M. H., Syed-Mohamad, S. M., Abdin, R. M. S., Damer, N., Abualigah, L., & Gandomi, A. H. (2022). Explore Big Data Analytics Applications and Opportunities: A Review. Big Data and Cognitive Computing, 6(157), 1–23. https://doi.org/10.3390/bdcc6040157
Alkire, S., Kanagaratman, U., & Suppa, N. (2021). The Global Multidimensional Poverty Index (MPI): 2021. OPHI MPI Methodological Note 51, 1–39. https://www.ophi.org.uk/wp-content/uploads/OPHI_MPI_MN_51_2021_4_2022.pdf
Almasah, M. Z., & Wijayanto, A. W. (2023). Comparison of Data Mining Methods in Classifying Village Status of Purwakarta and West Bandung Regencies (Podes 2021). Eigen Mathematics Journal, 6(1), 5–10. https://doi.org/10.29303/emj.v6i1.156
Apriliansyah, Pangestika, A., Ramadhanty, A. P., Putra, G. M., Putri, G. S. N. D. S., & Nooraeni, R. (2021). Classification of Village/Sub-district Status in Special Region of Yogyakarta Using the Decision Tree Model (Case Study of Field Work Practice Data of Politeknik Statitika STIS 2020). Engineering, MAthematics and Computer Science (EMACS) Journal, 3(1), 33–41. https://doi.org/10.21512/emacsjournal.v3i1.6787
Bates, S., Hastie, T., & Tibshirani, R. (2022). Cross-Validation: What Does It Estimate and How Well Does It Do It? Journal of the American Statistical Association, 1–43. https://doi.org/10.1080/01621459.2023.2197686
BPS-Statistics Sumatera Utara Province. (2025). Gini Ratio by Regency/Municipality in Sumatera Utara, 2024. https://sumut.bps.go.id/en/statistics-table/2/NDY3IzI=/gini-ratio-sumatera-utara-menurut-kabupaten-kota.html
Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 1–27. https://doi.org/10.1016/j.neucom.2019.10.118
Dicuonzo, G., Galeone, G., Shini, M., & Massari, A. (2022). Towards the Use of Big Data in Healthcare: A Literature Review. Healthcare, 10(1232), 1–16. https://doi.org/10.3390/healthcare10071232
Kulkarni, V. Y., & Sinha, P. K. (2014). Effective Learning and Classification using Random Forest Algorithm. International Journal of Engineering and Innovative Technology (IJEIT), 3(11), 267–273.
Li, Z. (2020). Geospatial Big Data Handling with High Performance Computing: Current Approaches and Future Directions. 53–76. https://doi.org/10.1007/978-3-030-47998-5_4
Mariyah, S., & Wobcke, W. (2025). Evaluating area-level features for proxy means test models: evidence from rural, semi-urban and urban districts in poverty targeting. Journal of Computational Social Science, 8(3). https://doi.org/10.1007/s42001-025-00405-8
Ministry of Villages and Development of Disadvantaged Regions of the Republic of Indonesia. (2025). Village Development Index (IDM) Status. https://idm.kemendesa.go.id/
Obi, J. C. (2023). A Comparative Study of Several Classification Metrics and Their Performances on Data. World Journal of Advanced Engineering Technology and Sciences, 08(01), 308–314. https://doi.org/10.30574/wjaets.2023.8.1.0054
Pramana, S., Yuniarto, B., Santoso, I., Nooraeni, R., & Suadaa, L. H. (2023). Data Mining with R Concepts and Implementation (2 ed.). In Media.
Presidential Regulation (Perpres) No. 12 of 2025 concerning the National Medium-Term Development Plan (RPJMN) 2025-2029., Pub. L. No. 12 of 2025 (2025).
Presidential Regulation (Perpres) No. 89 of 2024 concerning the Master Plan for the National Tourism Destination of Lake Toba for 2024 - 2044, Pub. L. No. 89 of 2024 (2024).
Primajaya, A., & Sari, B. N. (2018). Random Forest Algorithm for Prediction of Precipitation. Indonesian Journal of Artificial Intelligence and Data Mining (IJAIDM), 1(1), 27–31. https://doi.org/10.24014/ijaidm.v1i1.4903
Regulation of The Head of BPS-Statistics Indonesia Number 120 of 2020 Concerning Urban and Rural Village Classification in Indonesia 2020, Pub. L. No. 120 of 2020 (2020).
Schröer, C., Kruse, F., & Gómez, J. M. (2021). A systematic literature review on applying CRISP-DM process model. Procedia Computer Science, 181(2019), 526–534. https://doi.org/10.1016/j.procs.2021.01.199
Shi, K., Chang, Z., Chen, Z., Wu, J., & Yu, B. (2020). Identifying and evaluating poverty using multisource remote sensing and point of interest (POI) data: A case study of Chongqing, China. Journal of Cleaner Production, 255, 120245. https://doi.org/https://doi.org/10.1016/j.jclepro.2020.120245
Syahputra, H., & Wibowo, A. (2023). Comparison of Support Vector Machine (SVM) and Random Forest Algorithm for Detection of Negative Content on Websites. Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), 9(1), 165–173. https://doi.org/10.26555/jiteki.v9i1.25861
Tosi, D., Kokaj, R., & Roccetti, M. (2024). 15 years of Big Data: a systematic literature review. Journal of Big Data, 11(73), 1–39. https://doi.org/10.1186/s40537-024-00914-9
Wibawa, A. P., Kurniawan, A. C., Murti, D. M. P., Adiperkasa, R. P., Putra, S. M., Kurniawan, S. A., & Nugraha, Y. R. (2019). Naïve Bayes Classifier for Journal Quartile Classification. International Journal of Recent Contributions from Engineering, Science & IT (iJES), 7(2), 91. https://doi.org/10.3991/ijes.v7i2.10659
Yao, Y., Li, X., Liu, X., Liu, P., Liang, Z., Zhang, J., & Mai, K. (2017). Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. International Journal of Geographical Information Science, 31(4), 825–848. https://doi.org/https://doi.org/10.1080/13658816.2016.1244608
Yin, J., Dong, J., Hamm, N. A. S., Li, Z., Wang, J., Xing, H., & Fu, P. (2021). Integrating remote sensing and geospatial big data for urban land use mapping: A review. International Journal of Applied Earth Observation and Geoinformation, 103, 102514. https://doi.org/10.1016/j.jag.2021.102514
Zhang, P., Ke, Y., Zhang, Z., Wang, M., Li, P., & Zhang, S. (2018). Urban Land Use and Land Cover Classification Using Novel Deep Learning Models Based on High Spatial Resolution Satellite Imagery. Sensors, 18(11), 1–21. https://doi.org/10.3390/s18113717
Zhou, C., Li, F., Zhang, J., Zhao, J., Zhang, Y., & Wang, J. (2020). Analysis of Spatial and Temporal Variations of Vegetation Index in Liaodong Bay in the last 30 years based on the GEE Platform. IOP Conference Series: Earth and Environmental Science, 502, 1–8. https://doi.org/10.1088/1755-1315/502/1/012037
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Meilani Thereza Br. Saragih, Nurlatifah Hartojo

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: Creative Commons Attribution-Share Alike (CC BY-SA)