Desafios algorítmicos no processamento de grandes volumes de dados

  • Marcus Ritt UFRGS
  • Luciana S. Buriol UFRGS

Resumo


O volume de dados produzidos e disponíveis em organizações, empresas e em domínio público está crescendo exponencialmente. Entretanto, a capacidade de processar esses dados não tem crescido da mesma forma. O IDC estimou que em 2007 a quantidade de informação produzida mundialmente (255 EB) ultrapassou, pela primeira vez, o espaço disponível para seu armazenamento (246 EB). Como esses dados não somente precisam ser armazenados, mas também processados e analisados de forma eficiente, o interesse em algoritmos para processamento de dados massivos cresceu significadamente nos últimos anos. Este artigo discute os modelos de algoritmos propostos para o processamento de grandes volumes de dados, apresenta os desafios nesta área, identifica aplicações no cenário Brasileiro e discute soluções possíveis para que tais desafios sejam transpostos.

Referências

Abello, J. M. and Vitter, J. S., editors (1999). A Functional approach to external graph algorithms. Dimacs Series In Discrete Mathematics And Theoretical Computer Science. American Mathematical Society. 306p.

Aggarwal, C. C. (2007). Data Streams: Models and Algorithms. Springer.

Arge, L. (2003). The buffer tree: A technique for designing batched external data structures. Algorithmica, 37(1):1–24.

Arge, L., Brodal, G. S., and Toma, L. (2000). On external-memory MST, SSSP, and Multi-way Planar Graph Separation. In Proceedings of the 7th Scandinavian Workshop on Algorithm Theory, Lecture Notes In Computer Science, pages 433–447.

Buriol, L., Frahling, G., Leonardi, S., Sohler, C., and Marchetti-Spaccamela, A. (2006). Counting triangles in data streams. In 25th ACM Symposium on Principles of Database Systems (PODS2006), Chicago, Illinois, pages 253–262.

Buriol, L. S., Frahling, G., Leonardi, S., and Sohler, C. (2007). Estimating clustering indexes in data streams. In 15th Annual European Symposium on Algorithms (ESA), Springer Series Lecture Notes in Computer Science.

Chowdhury, R. A. (2007). Cache-efficient Algorithms and Data Structures: Theory and Experimental Evaluation. PhD thesis, University of Texas, Austin.

Chowdhury, R. A. and Ramachandran, V. (2006a). Cache-oblivious dynamic programming. In SODA ’06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 591–600, New York, NY, USA. ACM.

Chowdhury, R. A. and Ramachandran, V. (2006b). Cache-oblivious dynamic programming. In SODA ’06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 591–600, New York, NY, USA. ACM.

Conselho nacional de trânsito (2006). Resolução no. 212, Dispõe sobre a implantação do Sistema de Identificação Automática de Veículos – SINIAV em todo o território nacional.

Cormode, G., Muthukrishnan, S., Datar, M., and Indyk, P. (2003). Comparing data streams using hamming norms (how to zero in). IEEE Transactions on Knowledge and Data Engineering, 15(3):529–540.

Cortes, C., Fisher, K., Pregibon, D., and Rogers, A. (2000). Hancock: A language for extracting signatures from data streams. Proceedings of the ACM SIGKDD International Conference on Knoledge Discovery and Data Mining, ACM, pages 9–17.

Cranor, C., Johnson, T., Spataschek, O., and Shkapenyuk, V. (2003). Gigascope: A stream database for network applications. Proceedings of ACM SIGMOD, pages 647–651.

Datar, M. and Muthukrishnan, S. (2002). Estimating rarity and similarity over data stream windows. Proceedings of the 10th Annual European Symposium on Algorithms, pages 323–334.

Dementiev, R., Sanders, P., Schultes, D., and Sibeyn, J. F. (2004). Engineering an external memory minimum spanning tree algorithms. In 3rd IFIP International Conference on Theoretical Computer Science (TSC2004), pages 195–208, Toulouse, France.

Frigo, M. (1999). Portable High-Performance Programs. PhD thesis, MIT.

Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. (1999). Cache-oblivious algorithms. In Proc. 40th IEE Sympos. Found. Comp. Sci., pages 285–297.

Hennessy, J. L. and Patterson, D. A. (2002). Computer architecture: A quantitative approach. Morgan Kauffmann Publishers Inc., 3rd edition.

IDC (2007). The expanding digital universe: A forecast of the worldwide information growth through 2010. White paper sponsored by EMC-IDC.

Lander, E. S. (1999). Array of hope. Nature Genetics Supplement, 21:3–4.

Manku, G. S. and Motwani, R. (2002). Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases, pages 346–369.

Meyer, U., Sanders, P., and Sibeyn, J., editors (2003). Algorithms for Memory Hierarchies: Advanced Lectures. Springer, 1st edition.

Muthukrishnan, S. (2005). Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science. Now Publishers Inc.

NCBI. [link]. Online. Acessado em 22/09/2007.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web. Technical Report Technical report, Stanford University.

Prokop, H. (1999). Cache-oblivious algorithms. PhD thesis, MIT.

Strassen, V. (1969). Gaussian elimination is not optimal. Numer. Math., 13:354–356.

Szalay, A. and Gray, J. (2002). 2020 computing: Science in an exponential world. Nature, 440:413–414.

Vitter, J. S. (2007). External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, 33(2).

Whaley, R. C., Petitet, A., and Dongarra, J. (2001). Automated empirical optimizations of software and the atlas project. Parallel Computing, 27(1-2):3–35.

Y.J.Chiang, M.T.Goodrich, Grove, E., Tamassia, R., Vengroff, D., and Vitter, J. (1995). External-memory graph algorithms. In Proc. 6th ACM-SIAM Symposium on Discrete Algorithms, pages 139–149.

Z. Bar-Yosseff, R. Kumar, D. S. (2002). Reductions in streaming algorithms, with an application to counting triangles in graphs. Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 623–632.
Publicado
20/07/2009
RITT, Marcus; BURIOL, Luciana S.. Desafios algorítmicos no processamento de grandes volumes de dados. In: SEMINÁRIO INTEGRADO DE SOFTWARE E HARDWARE (SEMISH), 36. , 2009, Bento Gonçalves/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2009 . p. 169-183. ISSN 2595-6205.