A Peer-to-Peer system for executing bioinformatics applications: a case study of distributed and scalable information management over massive volumes of data

E. O. Ribeiro; M. E. M. T. Walter; M. M. Costa; G. Pappas; BIOFOCO Network

E. O. Ribeiro UnB
M. E. M. T. Walter UnB
M. M. Costa Embrapa
G. Pappas Embrapa
BIOFOCO Network

Resumo

The enourmous and heterogeneous data sets, distributed on the world, that must be processed in order to extract useful information, generated research in new techniques and methods for processing them correctly and in an acceptable time. Particularly, bioinformatics applications require large amounts of computational resources, as the algorithms are time expensive and the databases are heterogeneous and space consuming. Besides, the Peer-to-Peer (P2P) model has been successfully used for applications demanding great amount of computational resources that are executed on geographically distributed machines. So, in this work, we propose a distributed environment using the P2P model that can execute bioinformatics applications. For testing the system, we used BLAST, a family of algorithms for searching similarities between two biological sequences, since it is broadly used and requires a great amount of computation, due to very large databases. We describe the architecture of our framework, and present experiments using real biological data, that were executed on machines of three institutions, University of Brasilia, Catholic University of Brasilia and Embrapa/Genetic Resources and Biotechnology. The good results achieved, when comparing the performance of our distributed system with a standalone BLAST execution, show that P2P model can be successfully applied to bioinformatics applications. Generically, our system could contribute, in practice, for processing applications demanding massive volumes of data. At last, we address how our work relates with the Grand Challenges in Computer Science Research in Brazil (2006− 2016), as proposed by the Brazilian Computer Society, and discuss some research trends related to these Challenges.

Referências

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic Local Alignment Search Tool. Journal of Molecular Biology, 215:403–410.

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25:3389–402.

Anderson, D. P. (2004). BOINC: a system for public-resource computing and storage. Grid Computing, 2004. Proceedings. Fifth IEEE/ACM International Workshop on, pages 4–10.

Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Wheeler, D. L. (2005). Genbank. Nucleic Acids Research, 33 (Database issue):D34–D38.

Braun, R. C., Pedretti, K. T., Casavant, T. L., Scheetz, T. E., Birkett, C. L., and Roberts, C. A. (2001). Parallelization of local blast service on workstation clusters. Future Gener. Comput. Syst., 17(6):745–754.

Canada (2008). Biogrid canada [link]. Web Page.

Charikar, M., Chen, K., and Farach-Colton, M. (2002). Finding frequent items in data streams. In Proceedings of the 29th International Colloquium on Automata,Languages, and Programming, 2002.

Cormode, G. and Muthukrishnan, S. (2004). An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. In Farach-Colton, M., editor, LATIN 2004: Theoretical Informatics, 6th Latin American Symposium, Buenos Aires, Argentina, April 5-8, 2004, Proceedings, volume 2976 of Lecture Notes in Computer Science, pages 29–38. Springer.

Darling, A., Carey, L., and Feng, W. (2003). The Design, Implementation, and Evaluation of mpiBLAST. In 4th International Conference on Linux Clusters: The HPC Revolution 2003 in conjunction with the ClusterWorld Conference and Expo.

Dean, J. and Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113.

Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: amazon’s highly available key-value store. In SOSP ’07: Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 205–220, New York, NY, USA. ACM Press.

Felipe, M. S. S., Andrade, R. V., Almeida, N. F., Carvalho, M. J. A., CENTRO-OESTE, R. G., Walter, M. E. M. T., and Brígido, M. M. (2005). Transcriptional profiles of the human pathogenic fungus paracoccidioides brasiliensis in mycelium and yeast cells. The Journal of Biological Chemistry, 280(1074):24706–24714.

Feng, W. (2003). Green destiny + mpiblast = bioinfomagic. In 10th International Conference on Parallel Computing (ParCo), 2003.

Foster, I. and Iamnitchi, A. (2003). On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing. In 2nd International Workshop on Peer-to-Peer Systems (IPTPS ’03).

Foster, I., Kesselman, C., and Tuecke, S. (2001). The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Lecture Notes in Computer Science, 2150.

Foster, I., Voeckler, J., Wilde, M., and Zhao, Y. (2002). Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In 14th Conference on Scientific and Statistical Database Management.

Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29–43, New York, NY, USA. ACM Press.

Grant, J. D., Dunbrack, R. L., Manion, F. J., and Ochs, M. F. (2002a). BeoBLAST: distributed BLAST and PSI-BLAST on a Beowulf cluster. Bioinformatics, 18(5):765–766.

Grant, J. D., Jr., R. L. D., Manion, F. J., and Ochs, M. F. (2002b). Beoblast: distributed blast and psi-blast on a beowulf cluster. Bioinformatics, 18(5):765–766.

Guo, L., Chen, S., Xiao, Z., Tan, E., Ding, X., and Zhang, X. (2005). Measurements, Analysis, and Modeling of BitTorrent-like Systems. In Proceedings of the ACM/SIGCOMM Internet Measurement Conference (IMC-05), 2005.

Hamilton, J. R. (2007). On designing and deploying internet-scale services. In Proceedings of the 21th Large Installation System Administration Conference, LISA 2007, Dallas, Texas, USA, November 11-16, 2007, pages 231–242.

Hokamp, K., Shields, D. C., Wolfe, K. H., and Caffrey, D. R. (2003). Wrapping up Blast and other applications for use on Unix clusters. Bioinformatics, 19:441–442.

Japan (2008). Biogrid japan. [link]. Web Page.

JXTA (2008). Project JXTA – an open protocol for peer-to-peer [link]. Web Page.

Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., and Lewin, D. (1997). Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In STOC ’97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 654–663. ACM Press.

Krishnan, A. (2005a). GridBLAST: a Globus-based high-throughput implementation of BLAST in a Grid computing framework. Concurrency and Computation: Practice and Experience, 17:1607–1623.

Krishnan, A. (2005b). GridBLAST: a Globus-based high-throughput implementation of BLAST in a Grid computing framework. j-CCPE, 17(13):1607–1623.

Larson, S., Snow, C., Shirts, M., and Pande, V. (2002). Folding@home and Genome@home: using distributed computing to tackle previously intractable problems in computational biology. Computational Genomics.

Lee, Y. and Quackenbush, J. (2003). Using the TIGR gene index databases for biological discovery. Curr Protoc Bioinformatics, 1.6.

Lin, H., Ma, X., Chandramohan, P., Geist, A., and NagizaSamatova (2005). Efficient Data Access for Parallel BLAST. In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Papers, page 72.2, Washington, DC, USA. IEEE Computer Society.

Mathog, D. R. (2003). Parallel BLAST on split databases. Bioinformatics, 19(14):1865–1866.

Maymounkov, P. and Mazières, D. (2002). Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. In IPTPS First International Peer-to-Peer Systems Workshop, pages 53–65.

Milojicic, D. S., Kalogeraki, V., Lukose, R., Nagaraja, K., Pruyne, J., Richard, B., Rollins, S., and Xu, Z. (2002). Peer-to-Peer Computing. Technical report, HP Laboratories Palo Alto.

NCBI (2008). National center for biotechnology information [link]. Web Page.

Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. (2008). Pig Latin: A Not-So-Foreign Language for Data Processing. Proceedings of the ACM/SIGMOD, 2008. To be published.

Pavan, A. and Tirthapura, S. (2007). Range-efficient counting of distinct elements in a massive data stream. SIAM Journal on Computing, 37(2):359–379.

Pierre, G. and van Steen, M. (2006). Globule: a collaborative content delivery network. IEEE Communications Magazine, 44(8):127–133. [link].

Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. (2005). Interpreting the Data: Parallel Analysis with Sawzall. In Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure, volume 13, pages 227–298. Google Inc, IOS Press.

Ratnasamy, S., Francis, P., Handley, M., Karp, R., and Shenker, S. (2001). A Scalable Content Addressable Network. In Proceedings of ACM SIGCOMM 2001.

Rowstron, A. and Druschel, P. (2001). Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pages 329–350.

SBC (2006). Brazilian Computer Society (Sociedade Brasileira de Computação - SBC), Grand Challenges in Computer Science Research in Brazil (2006 - 2016). Web Page.

SBC (2008). Brazilian Computer Society (Sociedade Brasileira de Computação (Sociedade Brasileira de Computação - SBC). Web Page.

Shirts, M. and Pande, V. (2000). Screensavers of the world, unite! Science, 290:1903–1904.

Simpson, A. J. G. and co-authors (2000). The genome sequence of the plant pathogen Xylella fastidiosa. Nature, 406(6792):151–157.

Stoica, I., Morris, R., Karger, D., Kaashoek, F., and Balakrishnan, H. (2001). Chord: A scalable Peer-To-Peer Lookup Service for Internet Applications. In Proceedings of the 2001 ACM SIGCOMM Conference, pages 149–160.

Stribling, J., Sit, E., Kaashoek, M. F., Li, J., and Morris, R. (2007). Don’t give up on distributed file systems. In Proc. of the 6th IPTPS.

Sun Microsystems (2005). JXTA v2.3.x: Java Programmer’s Guide. White Paper.

Traversat, B., Abdelaziz, M., Duigou, M., Hugly, J., Pouyoul, E., and Yeager, B. (2002). Project JXTA Virtual Network.

Traversat, B., Abdelaziz, M., and Pouyoul, E. (2003). A Loosely-Consistent DHT Rendezvous Walker. Technical report, Sun Microsystems, Inc.

Wang, J. and Mu, Q. (2003). Soap-HT-BLAST: high throughput BLAST based on Web services. Bioinformatics, 19(14):1863–1864.

Wang, L., Park, K. S., Pang, R., Pai, V., and Peterson, L. (2004). Reliability and Security in the Codeen content distribution network. In ATEC ’04, pages 14–14, Berkeley, CA, USA. USENIX Association.

Zhao, B. Y., Huang, L., Stribling, J., S. C. Rhea, A. D. J., and Kubiatowicz, J. D. (2004). Tapestry: A Resilient Global-scale Overlay for Service Deployment. IEEE Journal on Selected Areas in Communications, 22(1).