QASports: A Question Answering Dataset about Sports

Pedro Calciolari Jardim; Leonardo Mauro Pereira Moraes; Cristina Dutra Aguiar

doi:10.5753/dsw.2023.233602

Pedro Calciolari Jardim Universidade de São Paulo http://orcid.org/0000-0001-9475-2526
Leonardo Mauro Pereira Moraes Universidade de São Paulo / Amaris Consulting https://orcid.org/0000-0002-9553-9978
Cristina Dutra Aguiar Universidade de São Paulo

DOI: https://doi.org/10.5753/dsw.2023.233602

Resumo

Sport is one of the most popular and revenue-generating forms of entertainment. Therefore, analyzing data related to this domain introduces several opportunities for Question Answering (QA) systems, such as supporting tactical decision-making. But, to develop and evaluate QA systems, researchers and developers need datasets that contain questions and their corresponding answers. In this paper, we focus on this issue. We propose QASports, the first large sports question answering dataset for extractive answer questions. QASports contains more than 1.5 million triples of questions, answers, and context about three popular sports: soccer, American football, and basketball. We describe the QASports processes of data collection and questions and answers generation. We also describe the characteristics of the QASports data. Furthermore, we analyze the sources used to obtain raw data and investigate the usability of QASports by issuing "wh-queries". Moreover, we describe scenarios for using QASports, highlighting its importance for training and evaluating QA systems.

Palavras-chave: Resposta a perguntas, Conjunto de dados, Esportes, Processamento de linguagem natural

Referências

Alvares, J. C. M. and Ribeiro, M. R. (2019). Soccernews2018: a dataset of statistics and news of the 2018 brazilian soccer championship. In XXXIV Simpósio Brasileiro de Banco de Dados: Dataset Showcase Workshop, SBBD 2019, pages 440–446, Fortaleza, CE, Brazil. SBC.

Athira, P., Sreeja, M., and Reghuraj, P. (2013). Architecture of an ontology-based domain-specific natural language question answering system. International Journal of Web & Semantic Technology, 4(4): article number 31.

Bartolo, M., Roberts, A., Welbl, J., Riedel, S., and Stenetorp, P. (2020). Beat the ai: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678.

Beal, R., Norman, T. J., and Ramchurn, S. D. (2019). Artificial intelligence for team sports: a survey. The Knowledge Engineering Review, 34:e28.

Hill, F., Bordes, A., Chopra, S., and Weston, J. (2016). The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.

Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. (2020). Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.

Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. (2017). Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.

Liu, Q., Jiang, S., Wang, Y., and Li, S. (2020). LiveQA: A question answering dataset over sports live. In Proceedings of the 19th Chinese National Conference on Computational Linguistics, pages 1057–1067, Haikou, China. Chinese Information Processing Society of China.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.

Mishra, A. and Jain, S. K. (2016). A survey on question answering systems with classification. Journal of King Saud University - Computer and Information Sciences, 28(3):345–361.

Mittell, J. (2009). Sites of participation: Wiki fandom and the case of lostpedia. Transformative Works and Cultures, 3(3):1–10.

Moraes, L. M. P., Jardim, P., and Aguiar, C. D. (2023). Design principles and a software reference architecture for big data question answering systems. In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS), pages 57– 67. INSTICC, SciTePress.

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., and Deng, L. (2016). MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268.

Pan, A., Chan, J. S., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S., and Hendrycks, D. (2023). Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683.

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, page arXiv:1606.05250.

Ribeiro, M. R., Barioni, M. C. N., de Amo, S., Roncancio, C., and Labbé, C. (2017). Soccer2014ds: a dataset containing player events from the 2014 world cup. In XXXII Simpósio Brasileiro de Banco de Dados: Dataset Showcase Workshop, SBBD 2017, pages 278–285, Uberlândia, MG, Brazil. SBC.

Richardson, M., Burges, C. J., and Renshaw, E. (2013). MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 193–203, Seattle, Washington, USA. Association for Computational Linguistics.