Towards an Empirical Evaluation of Scientific Data Indexing and Querying
Keywords:computational fluid dynamics, dataflow management, scientific data indexing, scientific data querying
Computational simulations usually produce large amounts of data on a regular time-step basis. Heterogeneous simulation outputs are stored in different file formats and on distinct storage devices. Therefore, the main challenges for accessing simulation data are related to time-to-query, which is the effort spent for setting all data into a common framework, the issuing of a high-level query statement, and obtaining the result set. The simulation data loading into DataBase Management Systems (DBMS) are either unpractical, as they demand a prohibitive time for data preparation, or unfeasible, as data files are still needed in their original form (scientific applications still need to read and write contents to those files). In this article, we discuss the complementary approaches of adaptive querying and raw data file indexing for accessing simulation results stored in multiple sources (e.g., raw data files) without data loading. In particular, we review (i) NoDB PostgresRAW routines for adaptive query processing, and (ii) FastBit methods for raw data file indexing and querying. We examine the behavior of both strategies regarding a real case study of computational fluid dynamics simulation in the domain of sediment deposition. In this experimental evaluation, we measured the elapsed time for index construction and query processing regarding six distinct query categories over 62 time steps, which sums up to different 372 queries on 44,160 files (12.2 GB) produced by the computational simulation. Results show that FastBit is faster than PostgresRAW for query execution in all but low-selectivity query scenarios. In a complementary manner, results also show PostgresRAW outperforms FastBit whenever users are interested in reducing time-to-query rather than the query execution time itself.
Alagiannis, I., Borovica, R., Branco, M., Idreos, S., and Ailamaki, A. Nodb: Efficient query execution on raw data files. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. SIGMOD ’12. ACM, New York, NY, USA, pp. 241–252, 2012a.
Alagiannis, I., Borovica, R., Branco, M., Idreos, S., and Ailamaki, A. Nodb in action: Adaptive query processing on raw data. Proc. VLDB Endow. 5 (12): 1942–1945, Aug., 2012b.
Ayachit, U., Bauer, A., Geveci, B., O’Leary, P., Moreland, K., Fabian, N., and Mauldin, J. Paraview catalyst: Enabling in situ data analysis and visualization. In Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization. ISAV2015. ACM, New York, NY, USA, pp. 25–29, 2015.
Blanas, S., Wu, K., Byna, S., Dong, B., and Shoshani, A. Parallel data analysis directly on scientific file formats. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD ’14. ACM, New York, NY, USA, pp. 385–396, 2014.
Camata, J. J., Silva, V., Valduriez, P., Mattoso, M., and Coutinho, A. L. In situ visualization and data analysis for turbidity currents simulation. Computers and Geosciences vol. 110, pp. 23 – 31, 2018.
Chou, J., Howison, M., Austin, B., Wu, K., Qiang, J., Bethel, E. W., Shoshani, A., Rübel, O., Prabhat, and Ryne, R. D. Parallel index and query for large scale data analysis. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’11. ACM, New York, NY, USA, pp. 30:1–30:11, 2011.
Clarke, J. A. and Mark, E. R. Enhancements to the extensible data model and format (xdmf). In DoD High Performance Computing Modernization Program Users Group Conference. DoD HPCMP ’07, vol. 1. IEEE, Pittsburgh, PA, USA, pp. 322–327, 2007.
Cudre-Mauroux, P., Kimura, H., Lim, K.-T., Rogers, J., Simakov, R., Soroush, E., Velikhov, P., Wang, D. L., Balazinska, M., Becla, J., DeWitt, D., Heath, B., Maier, D., Madden, S., Patel, J., Stonebraker, M., and Zdonik, S. A demonstration of scidb: A science-oriented dbms. Proc. VLDB Endow. 2 (2): 1534–1537, Aug., 2009.
Deshpande, A., Ives, Z., and Raman, V. Adaptive query processing. Foundations and Trends R in Databases 1 (1): 1–140, 2007.
Guedes, T., Sousa, V. S., Camata, J. J., Mattoso, M., and de Oliveira, D. Análise de dados científicos: uma análise comparativa de dados de simulações computacionais. In XXXII Simpósio Brasileiro de Banco de Dados - Short Papers, Uberlandia, MG, Brazil, October 4-7, 2017. pp. 222–227, 2017.
Karpathiotakis, M., Branco, M., Alagiannis, I., and Ailamaki, A. Adaptive query processing on raw data. Proc. VLDB Endow. 7 (12): 1119–1130, Aug., 2014.
Silva, V., de Oliveira, D., Valduriez, P., and Mattoso, M. Dfanalyzer: Runtime dataflow analysis of scientific applications using provenance. Proc. VLDB Endow. 11 (12): 2082–2085, Aug., 2018.
Silva, V., Leite, J., Camata, J. J., de Oliveira, D., Coutinho, A. L., Valduriez, P., and Mattoso, M. Raw data queries during data-intensive parallel workflow execution. Future Generation Computer Systems vol. 75, pp. 402 – 422, 2017.
Wang, H. and Zhai, Z. J. Advances in building simulation and computational techniques: A review between 1987 and 2014. Energy and Buildings vol. 128, pp. 319 – 335, 2016.
Wu, K. Fastbit: an efficient indexing technology for accelerating data-intensive science. Journal of Physics: Conference Series 16 (1): 556, 2005.