Streaming state management methods for real-time data deduplication

João V. A. Esteves; Sérgio Lifschitz; Rosa M. E. M. Costa; Ana Carolina Almeida

doi:10.5753/sbbd.2020.13652

João V. A. Esteves Universidade do Estado do Rio de Janeiro
Sérgio Lifschitz Pontifícia Universidade Católica do Rio de Janeiro https://orcid.org/0000-0003-3073-3734
Rosa M. E. M. Costa Universidade do Estado do Rio de Janeiro
Ana Carolina Almeida Universidade do Estado do Rio de Janeiro

DOI: https://doi.org/10.5753/sbbd.2020.13652

Resumo

Data duplication is a common problem on data streams processing applications that occurs due to software error or adoption of data loss prevention measures, jeopardizing real-time data analyses. This paper explores stream-based deduplication methods to identify challenges from these methods and proposes a decision method to choose the most appropriate strategy for a domain. This work investigates native solutions and auxiliary tools to provide data deduplication and fault tolerance. The experimental results show that it is necessary to use fast additional storage to persist the read keys, as long as they can appear, or to use the optimized storage, with a quick key search.

Palavras-chave: stateful streaming, state management, apache spark, data streaming, data deduplication, real-time processing

Referências

Carbone, P., Ewen, S., Fóra, G., Haridi, S., Richter, S. and Tzoumas, K. (2017). “State management in Apache Flink®: consistent stateful distributed stream processing”. In VLDB Endowment, v. 10, n. 12, pp. 1718-1729.

Del Monte, B., Zeuch, S., Rabl, T. and Markl, V. (2020). “Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines”. In ACM SIGMOD Int. Conf. on Management of Data, pp. 2471-2486, Oregon, US.

Duan, L., and Xiong, Y. (2015). “Big data analytics and business analytics”. In Journal of Management Analytics, v. 2, no. 1, pp. 1-21.

Fernandez, R. C., Migliavacca, M., Kalyvianaki, E. and Pietzuch, P. (2013). “Scalable and Fault-tolerant Stateful Stream Processing”.Imperial Coll. Comp. Student Wksp,UK.

Gedik, B., Andrade, H., Wu, K. L., Yu, P. S., and Doo, M. (2008). “SPADE: the system s declarative stream processing engine”. In ACM SIGMOD Int. Conf. on Management of data, pp. 1123-1134, Vancouver, CA.

Hoffmann, M., Lattuada, A. and McSherry, F. (2019). “Megaphone: Latency-conscious state migration for distributed streaming dataflows”. In VLDB Endowment, v. 12, n. 9, pp. 1002-1015.

Kaur, R., Chana, I. and Bhattacharya, J. (2018). “Data deduplication techniques for efficient cloud storage management: a systematic review”. In Journal of Supercomputing, v. 74, n. 5, pp. 2035-2085.

Kwon, Y., Balazinska, M., and Greenberg, A. (2008). “Fault-tolerant stream processing using a distributed, replicated file system”. In VLDB Endowment, v. 1, n. 1, pp. 574-585, Auckland, NZ.

Noghabi, S. A., Paramasivam, K., Pan, Y., Ramesh, N., Bringhurst, J., Gupta, I. and Campbell, R. H. (2017). “Samza: stateful scalable stream processing at LinkedIn”. In VLDB Endowment, v. 10, n. 12, pp. 1634-1645.

Stan, C. S., Pandelica, A. E., Zamfir, V. A., Stan, R. G., and Negru, C. (2019). “Apache Spark and Apache Ignite Performance Analysis”. In Int. Conf. on Control Systems and Computer Science (CSCS), pp. 726-733.

Wu, Y. and Tan, K. L. (2015). “ChronoStream: Elastic stateful stream computation in the cloud”. In IEEE 31st Int. Conf. on Data Engineering, pp. 723-734, Seoul, KR.

Xia, W., Feng, D., Jiang, H., Zhang, Y., Chang, V. and Zou, X. (2019). “Accelerating content-defined-chunking based data deduplication by exploiting parallelism”. In Future Generation Computer Systems, v. 98, pp. 406-418.

Xia, W., Jiang, H., Feng, D., Douglis, F., Shilane, P., Hua, Y., and Zhou, Y. (2016). “A comprehensive study of the past, present, and future of data deduplication”. In IEEE, v. 104, n. 9, pp. 1681-1710.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). “Spark: Cluster computing with working sets”. In USENIX conf. on Hot topics in cloud computing, v. 10, pp. 10.