A sketch for the KS test for Big Data
Resumo
Motivated by the challenges of Big Data, this paper presents an approximative algorithm to assess the Kolmogorov-Smirnov test. This goodness of fit statistical test is extensively used because it is non-parametric. This work focuses on the one-sample test, which considers the hypothesis that a given univariate sample follows some reference distribution. The method allows to evaluate the departure from such a distribution of a input stream, being space and time efficient. We show the accuracy of our algorithm by making several experiments in different scenarios: varying reference distribution and its parameters, sample size, and available memory. The performance of rival methods, some of which are considered the state-of-the-art, were compared. It is demonstrated that our algorithm is superior in most of the cases, considering the absolute error of the test statistic.
Referências
Bifet, A. Machine learning for data streams: with practical examples in MOA. Adaptive computation and machine learning series. MIT Press, Cambridge, Massachusetts, 2017.
Buragohain, C. and Suri, S. Quantiles on Streams. In Encyclopedia of Database Systems, L. LIU and M. T. ÖZSU (Eds.). Springer US, Boston, MA, pp. 2235-2240, 2009.
Claffy, K. C., Polyzos, G. C., and Braun, H.-W. Application of sampling methodologies to network traffic characterization. ACM SIGCOMM Computer Communication Review 23 (4): 194-203, Oct., 1993.
dos Reis, D. M., Flach, P., Matwin, S., and Batista, G. Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD '16. Association for Computing Machinery, New York, NY, USA, pp. 1545-1554, 2016.
Fenwick, P. M. A new data structure for cumulative frequency tables. Software: Practice and Experience 24 (3): 327-336, 1994.
Gonzalez, T., Sahni, S., and Franta, W. R. An Efficient Algorithm for the Kolmogorov-Smirnov and Lilliefors Tests. ACM Transactions on Mathematical Software 3 (1): 60-64, Mar., 1977.
Gonzalez, T., Sahni, S., and Franta, W. R. An efficient approximate algorithm for the Kolmogorov-Smirnov and Lilliefors tests. Journal of Statistical Computation and Simulation 6 (3-4): 257-263, Jan., 1978.
Greenwald, M. and Khanna, S. Space-efficient online computation of quantile summaries. ACM SIGMOD Record 30 (2): 58-66, May, 2001.
Halim, S. Competitive programming 4 : the lower bound of programming contests in the 2020s. Lulu.com, Singapore, 2020.
Hu, H., Kantardzic, M., and Sethi, T. S. No free lunch theorem for concept drift detection in streaming data classification: A review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10 (2): e1327, 2020.
Kifer, D., Ben-David, S., and Gehrke, J. Detecting change in data streams. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30. VLDB '04. VLDB Endowment, Toronto, Canada, pp. 180-191, 2004.
Laaksonen, A. Guide to Competitive Programming - Learning and Improving Algorithms Through Contests. Undergraduate Topics in Computer Science. Springer, Cham, Switzerland, 2017.
Lall, A. Data streaming algorithms for the kolmogorov-smirnov test. In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data). BIG DATA '15. IEEE Computer Society, USA, pp. 95-104, 2015.
Masson, C., Rim, J. E., and Lee, H. K. DDSketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. Proceedings of the VLDB Endowment 12 (12): 2195-2205, Aug., 2019.
Nguyen, H. D. A stream-suitable kolmogorov-smirnov-type test for big data analysis, 2017.
Nguyen, H. D. A Two-Sample Kolmogorov-Smirnov-Like Test for Big Data. In Data Mining, Y. L. Boo, D. Stirling, L. Chi, L. Liu, K.-L. Ong, and G. Williams (Eds.). Communications in Computer and Information Science. Springer, Singapore, pp. 89-106, 2018.
Pibiri, G. E. and Venturini, R. Practical trade-offs for the prefix-sum problem. Software: Practice and Experience 51 (5): 921-949, 2021.
Raab, C., Heusinger, M., and Schleif, F.-M. Reactive Soft Prototype Computing for Concept Drift Streams. Neurocomputing vol. 416, pp. 340-351, Nov., 2020.
Shrivastava, N., Buragohain, C., Agrawal, D., and Suri, S. Medians and beyond: New aggregation techniques for sensor networks. In Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems. SenSys '04. Association for Computing Machinery, New York, NY, USA, pp. 239-249, 2004.