Revisiting Silhouette Aggregation

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15243))

Included in the following conference series:

International Conference on Discovery Science

378 Accesses
7 Citations

Abstract

Silhouette coefficient is an established internal clustering evaluation measure that produces a score per data point, assessing the quality of its clustering assignment. To assess the quality of the clustering of the whole dataset, the scores of all the points in the dataset are typically (micro) averaged into a single value. An alternative path, however, that is rarely employed, is to average first at the cluster level and then (macro) average across clusters. As we illustrate in this work with a synthetic example, the typical micro-averaging strategy is sensitive to cluster imbalance while the overlooked macro-averaging strategy is far more robust. By investigating macro-Silhouette further, we find that uniform sub-sampling, the only available strategy in existing libraries, harms the measure’s robustness against imbalance. We address this issue by proposing a per-cluster sampling method. An empirical analysis on eight real-world datasets in two clustering tasks reveals the disagreement between the two coefficients for imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An evolving approach to the similarity-based modeling for online clustering in non-stationary environments

Article 30 November 2024

Silhouette Index as Clustering Evaluation Tool

A comparative evaluation of clustering-based outlier detection

Article Open access 03 February 2025

Notes

References

Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Layton, R., Watters, P., Dazeley, R.: Evaluating authorship distance methods using the positive silhouette coefficient. Nat. Lang. Eng. 19(4), 517–535 (2013)
Article Google Scholar
Bafna, P., Pramod, D., Vaidya, A.: Document clustering: TF-IDF approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 61–66. IEEE (2016)
Google Scholar
Tambunan, H.B., Barus, D.H., Hartono, J., Alam, A.S., Nugraha, D.A., Usman, H.H.H.: Electrical peak load clustering analysis using k-means algorithm and silhouette coefficient. In: 2020 International Conference on Technology and Policy in Energy and Electric Power (ICT-PEP), pp. 258–262. IEEE (2020)
Google Scholar
Gaudreault, J.-G., Branco, P.: Empirical analysis of performance assessment for imbalanced classification. Mach. Learn. 1–43 (2024)
Google Scholar
Suhaimi, N.S., Othman, Z., Yaakub, M.R.: Comparative analysis between macro and micro-accuracy in imbalance dataset for movie review classification. In: Proceedings of Seventh International Congress on Information and Communication Technology: ICICT 2022, London, Volume 3, pp. 83–93. Springer, Cham (2022)
Google Scholar
Schubert, E.: Stop using the elbow criterion for k-means and how to choose the number of clusters instead. ACM SIGKDD Explorations Newsl. 25(1), 36–42 (2023)
Article MATH Google Scholar
Azimi, R., Ghayekhloo, M., Ghofrani, M., Sajedi, H.: A novel clustering algorithm based on data transformation approaches. Expert Syst. Appl. 76, 59–70 (2017)
Article MATH Google Scholar
Dudek, A.: Silhouette index as clustering evaluation tool. In: Jajuga, K., Batóg, J., Walesiak, M. (eds.) SKAD 2019. SCDAKO, pp. 19–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52348-0_2
Chapter MATH�� Google Scholar
Ünlü, R., Xanthopoulos, P.: Estimating the number of clusters in a dataset via consensus clustering. Expert Syst. Appl. 125, 33–39 (2019)
Article MATH Google Scholar
Batool, F., Hennig, C.: Clustering with the average silhouette width. Comput. Stat. Data Anal. 158, 107190 (2021)
Article MathSciNet MATH Google Scholar
Shahapure, K.R., Nicholas, C.: Cluster quality analysis using silhouette score. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pp. 747–748. IEEE (2020)
Google Scholar
Kang, J.H., Park, C.H., Kim, S.B.: Recursive partitioning clustering tree algorithm. Pattern Anal. Appl. 19, 355–367 (2016)
Article MathSciNet MATH Google Scholar
Řezanková, H.: Different approaches to the silhouette coefficient calculation in cluster evaluation. In: 21st International Scientific Conference AMSE Applications of Mathematics and Statistics in Economics, pp. 1–10 (2018)
Google Scholar
Brun, M., et al.: Model-based evaluation of clustering validation measures. Pattern Recognit. 40(3), 807–824 (2007)
Article MATH Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Article MATH Google Scholar
Ezugwu, A.E., et al.: A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng. Appl. Artif. Intell. 110, 104743 (2022)
Article MATH Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)
Article MATH Google Scholar
Von Luxburg, U., Williamson, R.C., Guyon, I.: Clustering: science or art? In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pp. 65–79. JMLR Workshop and Conference Proceedings (2012)
Google Scholar
Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E.M.: Internal versus external cluster validation indexes. Int. J. Comput. Commun. 5(1), 27–34 (2011)
Google Scholar
Estévez, P.A., Tesmer, M., Perez, C.A., Zurada, J.M.: Normalized mutual information feature selection. IEEE Trans. Neural Netw. 20(2), 189–201 (2009)
Article Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11(95), 2837–2854 (2010)
MathSciNet MATH Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article MATH Google Scholar
Chacón, J.E., Rastrojo, A.I.: Minimum adjusted rand index for two clusterings of a given size. Adv. Data Anal. Classif. 1–9 (2022)
Google Scholar
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 3(1), 1–27 (1974)
Article MathSciNet MATH Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. (2), 224–227 (1979)
Google Scholar
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recognit. 46(1), 243–256 (2013)
Article MATH Google Scholar
Capó, M., Pérez, A., Lozano, J.A.: Fast computation of cluster validity measures for bregman divergences and benefits. Pattern Recognit. Lett. 170, 100–105 (2023)
Article MATH Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017)
Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1073–1080 (2009)
Google Scholar
Celebi, M.E., Kingravi, H.A., Vela, P.A.: A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 40(1), 200–210 (2013)
Article MATH Google Scholar

Download references

Acknowledgements

– This work has been supported by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0 funded by the European Union under the NextGenerationEU Program.

– This research project is implemented in the framework of H.F.R.I. call “Basic research Financing (Horizontal support of all Sciences)” under the National Recovery and Resilience Plan “Greece 2.0” funded by the European Union - NextGenerationEU (H.F.R.I. ProjectNumber: 15940).

Author information

Authors and Affiliations

Department of Informatics, Athens University of Economics and Business, Patission 76, 104 34, Athens, Greece
John Pavlopoulos
Archimedes/Athena RC, Athens, Greece
John Pavlopoulos
Department of Computer Science and Engineering, University of Ioannina, 45110, Ioannina, Greece
Georgios Vardakas & Aristidis Likas

Authors

John Pavlopoulos
View author publications
Search author on:PubMed Google Scholar
Georgios Vardakas
View author publications
Search author on:PubMed Google Scholar
Aristidis Likas
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to John Pavlopoulos .

Editor information

Editors and Affiliations

University of Pisa, Pisa, Italy
Dino Pedreschi
University of Pisa, Pisa, Italy
Anna Monreale
University of Pisa, Pisa, Pisa, Italy
Riccardo Guidotti
Scuola Normale Superiore (SNS), Pisa, Italy
Roberto Pellungrini
University of Pisa, Pisa, Italy
Francesca Naretto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pavlopoulos, J., Vardakas, G., Likas, A. (2025). Revisiting Silhouette Aggregation. In: Pedreschi, D., Monreale, A., Guidotti, R., Pellungrini, R., Naretto, F. (eds) Discovery Science. DS 2024. Lecture Notes in Computer Science(), vol 15243. Springer, Cham. https://doi.org/10.1007/978-3-031-78977-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-78977-9_23
Published: 28 January 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78976-2
Online ISBN: 978-3-031-78977-9
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics