A Novel Approach for Unveiling Layered Geometric Patterns in Noisy Unsupervised Data: A Study on Drug-Like Molecules

Luiz C. D. Cavalcanti; Ricardo A. Rios; Tiago J. S. Lopes; Tatiane N. Rios

doi:10.5753/eniac.2025.14128

Luiz C. D. Cavalcanti UFBA
Ricardo A. Rios UFBA
Tiago J. S. Lopes Nezu Biotech GmbH
Tatiane N. Rios UFBA

DOI: https://doi.org/10.5753/eniac.2025.14128

Resumo

Identifying meaningful examples in datasets where most or all data belong to a single category is a common challenge in Machine Learning (ML). In many real-world scenarios, such as in science, medicine, and industry, data for the target class is often abundant, while data from other classes is scarce or missing. This makes it difficult for ML models to differentiate between what belongs to the target class and what does not. This paper presents a novel approach to address this issue by combining clustering and geometric analysis techniques. We develop a one-class classification method capable of detecting when a sample belongs to the target class, even in the absence of labeled data from other classes. The proposed method is applied to a curated dataset containing the chemical properties of 1,615 drug molecules approved by U.S. Food and Drug Administration (FDA), offering a valuable resource for future research. Our findings indicate that integrating geometric and density-based insights improves generalization and risk estimation in one-class learning tasks, providing a robust solution for analyzing noisy, unlabeled data.

Referências

Bair, E. (2013). Semi-supervised clustering methods. WIREs Computational Statistics, 5(5):349–361.

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. (2000). Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104.

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD’96, page 226–231. AAAI Press.

Goodrich, M. T. and Kitagawa, R. (2024). Making quickhull more like quicksort: A simple randomized output-sensitive convex hull algorithm.

Jain, A. K., Dubes, R. C., et al. (1988). Algorithms for clustering data, volume 6. Prentice hall Englewood Cliffs.

Leng, Q., Wang, S., Qin, Y., and Li, Y. (2019). An effective method to determine whether a point is within a convex hull and its generalized convex polyhedron classifier. Information Sciences, 504:435–448.

Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008). Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413–422. IEEE.

O’Rourke, J. (1998). Computational geometry in C. Cambridge University Press, Cambridge, United Kingdom.

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65.

Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471.

Siegel, A. (2022). A parallel algorithm for understanding design spaces and performing convex hull computations. Journal of Computational Mathematics and Data Science, 2:100021.

Toshniwal, A., Mahesh, K., and Jayashree, R. (2020). Overview of anomaly detection techniques in machine learning. In 2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pages 808–815.

Xu, R. and Wunsch II, D. C. (2008). Clustering. IEEE Press Series on Computational Intelligence. Wiley, Hoboken, NJ, 1 edition.