ParallelNACluster: A parallel clustering strategy for matching multiple catalogs
Abstract
The astronomical catalogs cross-matching aims to identify common celestial objects present in different astronomical surveys. Traditional approaches in astronomy do not provide solutions to the problem of matching in the context of large data volume. In this paper, we have improved the NACluster algorithm by creating the ParallelNACluster strategy, a parallel version of NACluster that takes advantage of input data partitioning, and accepts large volumes of data even using a small hardware set. In addition, we propose the SCIBoundary, a new strategy for matching neighboring stars placed in different data partitions. The strategy leads to equivalent solutions in both NACluster pand ParallelNACluster.
Keywords:
Data Matching, Parallelism, Clustering
References
Dai, B.-R. and Lin, I.-C. (2012). Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition. In Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing, pages 59–66, Washington, DC, USA. IEEE Computer Society.
Freire, V. P., Porto, F., Akbarinia, R., and de Macêdo, J. A. F. (2014). NACluster: A Non-supervised Clustering Algorithm for Matching Multi Catalogues. In 2014 IEEE 10th International Conference on e-Science, pages 83–86. IEEE.
Gaspar, D. and Porto, F. (2014). A Multi-Dimensional Equi-Depth Partitioning Strategy for Astronomy Catalog Data.
Kwon, Y., Nunley, D., Gardner, J. P., Balazinska, M., Howe, B., and Loebman, S. (2010). Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6187 LNCS:132–150.
Zaschke, T., Zimmerli, C., and Norrie, M. C. (2014). The PH-tree: A Space-efficient Storage Structure and Multi-dimensional Index. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 397–408, New York, NY, USA. ACM.
Zhao, W., Ma, H., and He, Q. (2009). Parallel k-means clustering based on mapreduce. In IEEE International Conference on Cloud Computing, pages 674–679. Springer.
Freire, V. P., Porto, F., Akbarinia, R., and de Macêdo, J. A. F. (2014). NACluster: A Non-supervised Clustering Algorithm for Matching Multi Catalogues. In 2014 IEEE 10th International Conference on e-Science, pages 83–86. IEEE.
Gaspar, D. and Porto, F. (2014). A Multi-Dimensional Equi-Depth Partitioning Strategy for Astronomy Catalog Data.
Kwon, Y., Nunley, D., Gardner, J. P., Balazinska, M., Howe, B., and Loebman, S. (2010). Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6187 LNCS:132–150.
Zaschke, T., Zimmerli, C., and Norrie, M. C. (2014). The PH-tree: A Space-efficient Storage Structure and Multi-dimensional Index. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 397–408, New York, NY, USA. ACM.
Zhao, W., Ma, H., and He, Q. (2009). Parallel k-means clustering based on mapreduce. In IEEE International Conference on Cloud Computing, pages 674–679. Springer.
Published
2017-10-02
How to Cite
FREIRE, Vinícius Pires de Moura; PORTO, Fábio; MACÊDO, José A. F. de.
ParallelNACluster: A parallel clustering strategy for matching multiple catalogs. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 32. , 2017, Uberlândia/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2017
.
p. 100-111.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2017.171359.
