Discovering the Optimal Number of Crime Cluster Using Elbow, Silhouette, Gap Statistics, and NbClust Methods

Noviyanti T. M. Sagala; Alexander Agung Santoso Gunawan

doi:10.21512/comtech.v13i1.7270

Authors

Noviyanti T. M. Sagala Bina Nusantara University
Alexander Agung Santoso Gunawan Bina Nusantara University

DOI:

https://doi.org/10.21512/comtech.v13i1.7270

Keywords:

Crime Clustering, Elbow method, Silhouette method, Gap Statistics method, NbClust method

Abstract

In recent years,Â crime has been critical to be analyzed and tracked to identify the trends and associations with crime patterns and activities. Generally, the analysis is conducted to discover the area or location where the crime is high or low by using different clustering methods, including k-means clustering. Even though the k-means algorithm is commonly used in clustering techniques because of its simplicity, convergence speed, and high efficiency, finding the optimal number of clusters is difficult. Determining the correct clusters for crime analysis is critical to enhancing current crime resolution rates, avoiding future incidents, spending less time for new officers, and increasing activity quality. To address the problem of estimating the number of clusters in the crime domain without the interference of humans, the research carried out Elbow, Silhouette, Gap Statistics, andÂ NbClustÂ methods on datasets of Major Crime Indicators (MCI) in 2014âˆ’2019.Â Several stages were performed to process the crime datasets: data understanding, data preparation, cluster modelling, and cluster validation. The first two phases were performed in the R Studio environment and the last two stages in Azure Studio. From the experimental result, Elbow, Silhouette, and NbClustÂ methods suggest a similar number ofÂ optimumÂ clusters that is two. After validating the result using the average Silhouette method, the researchÂ considersÂ two clusters as the best clusters for the dataset. The visualization result of Silhouette method displays the value of 0,73. Then, the observation of the data is well-grouped. It is placed in the correct group.

Dimensions

Author Biographies

Noviyanti T. M. Sagala, Bina Nusantara University

Statistics Department, School of Computer Science

Alexander Agung Santoso Gunawan, Bina Nusantara University

Computer Science Department, School of Computer Science

References

Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data (pp. 25‒71). Springer.

Bokde, K. A., Kakade, T. P., Tumsare, D. S., & Wadhai, C. G. (2018). Crime detection technique using data mining and K-means. International Journal of Engineering Research & Technology (IJERT), 7(02), 223‒226.

Charrad, M., Ghazzali, N., Boiteau, V., & Niknafs, A. (2014). NbClust: An R package for determining the relevant number of clusters in a data set. Journal of Statistical Software, 61(6), 1‒36.

Hajela, G., Chawla, M., & Rasool, A. (2020). A clustering based hotspot identification approach for crime prediction. Procedia Computer Science, 167, 1462–1470.

Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques. Elsevier.

Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice-Hall, Inc.

Joshi, A., Sabitha, A. S., & Choudhury, T. (2017). Crime analysis using K-means clustering. In 2017 3rd International Conference on Computational Intelligence and Networks (CINE) (pp. 33‒39). IEEE.

Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. John Wiley & Sons.

Kingrani, S. K., Levene, M., & Zhang, D. (2018). Estimating the number of clusters using diversity. Artificial Intelligence Research, 7(1), 15–22.

Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. Journal of Statistical Software, 36(11), 1–13.

Li, X. Y., Yu, L. Y., Lei, H., & Tang, X. F. (2017). The parallel implementation and application of an improved K-means algorithm. Journal of University of Electronic Science and Technology of China, 46(1), 61–68.

Maheswari, K. (2019). Finding best possible number of clusters using K-means algorithm. International Journal of Engineering and Advanced Technology, 9(1S3), 533–538.

Nath, S. V. (2006). Crime pattern detection using data mining. In 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops (pp. 41–44). IEEE.

Palacio-NiÃ±o, J. O., & Berzal, F. (2019). Evaluation metrics for unsupervised learning algorithms. ArXiv Preprint ArXiv:1905.05667.

Prabakaran, S., & Mitra, S. (2018). Survey of analysis of crime detection techniques using data mining and machine learning. Journal of Physics: Conference Series, 1000, 1‒10.

Saleh, M. A., & Khan, I. R. (2019). Crime data analysis in Python using K-means clustering. International Journal for Research in Applied Science & Engineering Technology (IJRASET), 7(IV), 151–155.

Subbalakshmi, C., Krishna, G. R., Rao, S. K. M., & Rao, P. V. (2015). A method to find optimum number of clusters based on fuzzy silhouette on dynamic data set. Procedia Computer Science, 46, 346–353.

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423.

Toronto Police Service Public Safety Data Portal. (n.d.) Major Crime Indicators. Retrieved from https://data.torontopolice.on.ca/datasets/TorontoPS::major-crime-indicators-1/about

Xiao, Y., & Yu, J. (2007). Gap statistic and K-means algorithm. J. Comput. Res. Dev, 44, 176–180.

Yuan, C., & Yang, H. (2019). Research on K-value selection method of K-means clustering algorithm. J: Multidisciplinary Scientific Journal, 2(2), 226‒235.