Skip to main content

A Clustering Validation Index Based on Semantic Description

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2023)

Abstract

In clustering problems where the objective is not based on specifically spatial proximity, but rather on feature patterns and the semantic description, traditional internal cluster validation indices might not be appropriate. This article proposes a novel validity index to suggest the most appropriate number of clusters based on a semantic description of categorical databases. To assess our index, we also propose a synthetic data generator specifically designed for this type of application. We tested data sets with different configurations to assess the performance of the proposed index compared to well-known indices in the literature. Thus, we demonstrate that the index has great potential for discovering the number of clusters for the type of application studied and the data generator is able to produce relevant data sets for the internal validation process.

Supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://dcta.mil.br/.

  2. 2.

    https://github.com/verri/sledge.

  3. 3.

    https://github.com/aquinordg/rdga_4k.

References

  1. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings 20th International Conference on Very Large Data Bases, VLDB. vol. 1215, Santiago, Chile, pp. 487–499 (1994)

    Google Scholar 

  2. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recogn. 46(1), 243–256 (2013)

    Article  Google Scholar 

  3. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 3(1), 1–27 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  4. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)

    Article  Google Scholar 

  5. Dimitriadou, E., Dolničar, S., Weingessel, A.: An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1), 137–159 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  6. Dorman, K.S., Maitra, R.: An efficient k-modes algorithm for clustering categorical datasets. Stat. Anal. Data Mining ASA Data Sci. J. 15(1), 83–97 (2022)

    Article  MathSciNet  Google Scholar 

  7. Gao, X., Yang, M.: Understanding and enhancement of internal clustering validation indexes for categorical data. Algorithms 11(11), 177 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  8. Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)

    Article  Google Scholar 

  9. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)

    Article  MATH  Google Scholar 

  10. Inc., T.M.: Matlab version: 9.13.0 (r2022b) (2022). https://www.mathworks.com

  11. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)

    Article  Google Scholar 

  12. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering validation measures. In: 2010 IEEE International Conference on Data Mining, pp. 911–916. IEEE (2010)

    Google Scholar 

  13. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat., 50–60 (1947)

    Google Scholar 

  14. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  15. R Core Team: R: A language and environment for statistical computing (2021). https://www.R-project.org/

  16. Rojas-Thomas, J.C., Santos, M.: New internal clustering validation measure for contiguous arbitrary-shape clusters. Int. J. Intell. Syst. 36(10), 5506–5529 (2021)

    Article  Google Scholar 

  17. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  18. Saha, J., Mukherjee, J.: Cnak: cluster number assisted k-means. Pattern Recogn. 110, 107625 (2021)

    Article  Google Scholar 

  19. Ünlü, R., Xanthopoulos, P.: Estimating the number of clusters in a dataset via consensus clustering. Expert Syst. Appl. 125, 33–39 (2019)

    Article  Google Scholar 

  20. Vinh, N.X., Epps, J.: A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. In: 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering, pp. 84–91. IEEE (2009)

    Google Scholar 

  21. Witten, I.H., Frank, E., Hall, M.A., Pal, C.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roberto Douglas Guimarães de Aquino .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de Aquino, R.D.G., Curtis, V.V., Verri, F.A.N. (2023). A Clustering Validation Index Based on Semantic Description. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45392-2_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45391-5

  • Online ISBN: 978-3-031-45392-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics