Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Mirelle Bueno; E. Seiti de Oliveira; Rodrigo Nogueira; Roberto Lotufo; Jayr Pereira

doi:10.5753/stil.2024.245426

Mirelle Bueno UNICAMP https://orcid.org/0000-0003-2374-6123
E. Seiti de Oliveira UNICAMP https://orcid.org/0000-0002-7882-6203
Rodrigo Nogueira UNICAMP / Maritaca AI https://orcid.org/0000-0002-2600-6035
Roberto Lotufo UNICAMP / NeuralMind.ai https://orcid.org/0000-0002-5652-0852
Jayr Pereira UFCA https://orcid.org/0000-0001-5478-438X

DOI: https://doi.org/10.5753/stil.2024.245426

Resumo

We present Quati, a dataset specifically designed for evaluating Information Retrieval (IR) systems for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of frequently accessed Brazilian Portuguese websites, which ensures a representative and relevant corpus. To label the query–document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. Our annotation methodology is described, enabling the cost-effective creation of similar datasets for other languages, with an arbitrary number of labeled documents per query. As a baseline, we evaluate a diverse range of open-source and commercial retrievers. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati, and all scripts at https://github.com/unicamp-dl/quati.

Palavras-chave: Information Retrieval, Brazilian Portuguese Dataset

Referências

Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., McNamara, A., Mitra, B., Nguyen, T., et al. (2016). Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. [link] DOI: 10.48550/arXiv.1611.09268

Bonifacio, L., Jeronymo, V., Abonizio, H. Q., Campiotti, I., Fadaee, M., Lotufo, R., and Nogueira, R. (2021). mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897. [link] DOI: 10.48550/arXiv.2108.13897

Clark, J. H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., and Palomaki, J. (2020). Tydi qa: A benchmark for information-seeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470. [link] DOI: 10.1162/tacl_a_00317

Cormack, G. V., Clarke, C. L., and Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759. [link] DOI: 10.1145/1571941.1572114

Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E. M., and Soboroff, I. (2021). Trec deep learning track: Reusable test collections in the large data regime. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 2369–2375. [link] DOI: 10.1145/3404835.3463249

Damessie, T. T., Nghiem, T. P., Scholer, F., and Culpepper, J. S. (2017). Gauging the quality of relevance assessments using inter-rater agreement. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1089–1092. [link] DOI: 10.1145/3077136.3080729

Faggioli, G., Dietz, L., Clarke, C., Demartini, G., Hagen, M., Hauff, C., Kando, N., Kanoulas, E., Potthast, M., Stein, B., et al. (2023). Perspectives on large language models for relevance judgment. arXiv preprint arXiv:2304.09161. [link] DOI: 10.48550/arXiv.2304.09161

Farzi, N. and Dietz, L. (2024). An exam-based evaluation approach beyond traditional relevance judgments. arXiv preprint arXiv:2402.00309. [link] DOI: 10.48550/arXiv.2402.00309

Formal, T., Lassance, C., Piwowarski, B., and Clinchant, S. (2021). Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086. [link] DOI: 10.48550/arXiv.2109.10086

Jeronymo, V., Nascimento, M., Lotufo, R., and Nogueira, R. (2022). mrobust04: A multilingual version of the trec robust 2004 benchmark. arXiv preprint arXiv:2209.13738. [link] DOI: 10.48550/arXiv.2209.13738

Johnson, J., Douze, M., and Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547. [link] DOI: 10.48550/arXiv.1702.08734

Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., and Mikolov, T. (2016a). Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651. [link] DOI: 10.48550/arXiv.1612.03651

Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016b). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. [link] DOI: 10.48550/arXiv.1607.01759

Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., et al. (2022). Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249. [link] DOI: 10.48550/arXiv.2205.13147

Lawrie, D., Mayfield, J., Oard, D. W., and Yang, E. (2022). Hc4: A new suite of test collections for ad hoc clir. In European Conference on Information Retrieval, pages 351–366. Springer. [link] DOI: 10.1007/978-3-030-99736-6_24

Lewis, D. D., Yang, Y., Russell-Rose, T., and Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397. [link]

Lima de Oliveira, L., Romeu, R. K., and Moreira, V. P. (2021). Regis: A test collection for geoscientific documents in portuguese. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2363–2368. [link] DOI: 10.1145/3404835.3463256

Nair, S., Yang, E., Lawrie, D., Duh, K., McNamee, P., Murray, K., Mayfield, J., and Oard, D. W. (2022). Transfer learning approaches for building cross-language dense retrieval models. In European Conference on Information Retrieval, pages 382–396. Springer. [link] DOI: 10.1007/978-3-030-99736-6_26

Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., and Liang, X. (2018). doccano: Text annotation tool for human. Software available from [link]. [link]

Overwijk, A., Xiong, C., and Callan, J. (2022). Clueweb22: 10 billion web documents with rich information. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3360–3362. [link] DOI: 10.1145/3477495.3536321

Peters, C. and Braschler, M. (2002). The importance of evaluation for cross-language system development: the clef experience. In LREC. Citeseer. [link]

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446. [link] DOI: 10.48550/arXiv.2112.11446

Sakai, T., Oard, D. W., and Kando, N. (2021). Evaluating Information Retrieval and Access Tasks: NTCIR’s Legacy of Research Impact. Springer Nature. [link] DOI: 10.1007/978-981-15-5554-1

Schäuble, P. and Sheridan, P. (1998). Cross-language information retrieval (clir) track overview. NIST SPECIAL PUBLICATION SP, pages 31–44. [link]

Thomas, P., Spielman, S., Craswell, N., and Mitra, B. (2023). Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621. [link] DOI: 10.48550/arXiv.2309.10621

Vitório, D., Souza, E., Martins, L., da Silva, N. F., de Carvalho, A. C. P. d. L., Oliveira, A. L., and de Andrade, F. E. (2024). Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the brazilian chamber of deputies. Language Resources and Evaluation, pages 1–21. [link] DOI: 10.1007/s10579-024-09767-3

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. (2022). Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533. [link] DOI: 10.48550/arXiv.2212.03533

Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc. [link]

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. (2020). mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934. [link] DOI: 10.48550/arXiv.2010.11934

Zendel, O., Culpepper, J. S., Scholer, F., and Thomas, P. (2024). Enhancing human annotation: Leveraging large language models and efficient batch processing. In Proceedings of the 2024 Conference on Human Information Interaction and Retrieval, pages 340–345. [link]

Zhang, X., Thakur, N., Ogundepo, O., Kamalloo, E., Alfonso-Hermelo, D., Li, X., Liu, Q., Rezagholizadeh, M., and Lin, J. (2023). Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114–1131. [link] DOI: 10.1162/tacl_a_00595