Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers
Resumo
We present Quati, a dataset specifically designed for evaluating Information Retrieval (IR) systems for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of frequently accessed Brazilian Portuguese websites, which ensures a representative and relevant corpus. To label the query–document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. Our annotation methodology is described, enabling the cost-effective creation of similar datasets for other languages, with an arbitrary number of labeled documents per query. As a baseline, we evaluate a diverse range of open-source and commercial retrievers. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati, and all scripts at https://github.com/unicamp-dl/quati.
Referências
Bonifacio, L., Jeronymo, V., Abonizio, H. Q., Campiotti, I., Fadaee, M., Lotufo, R., and Nogueira, R. (2021). mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897. [link] DOI: 10.48550/arXiv.2108.13897
Clark, J. H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., and Palomaki, J. (2020). Tydi qa: A benchmark for information-seeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470. [link] DOI: 10.1162/tacl_a_00317
Cormack, G. V., Clarke, C. L., and Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759. [link] DOI: 10.1145/1571941.1572114
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E. M., and Soboroff, I. (2021). Trec deep learning track: Reusable test collections in the large data regime. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 2369–2375. [link] DOI: 10.1145/3404835.3463249
Damessie, T. T., Nghiem, T. P., Scholer, F., and Culpepper, J. S. (2017). Gauging the quality of relevance assessments using inter-rater agreement. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1089–1092. [link] DOI: 10.1145/3077136.3080729
Faggioli, G., Dietz, L., Clarke, C., Demartini, G., Hagen, M., Hauff, C., Kando, N., Kanoulas, E., Potthast, M., Stein, B., et al. (2023). Perspectives on large language models for relevance judgment. arXiv preprint arXiv:2304.09161. [link] DOI: 10.48550/arXiv.2304.09161
Farzi, N. and Dietz, L. (2024). An exam-based evaluation approach beyond traditional relevance judgments. arXiv preprint arXiv:2402.00309. [link] DOI: 10.48550/arXiv.2402.00309
Formal, T., Lassance, C., Piwowarski, B., and Clinchant, S. (2021). Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086. [link] DOI: 10.48550/arXiv.2109.10086
Jeronymo, V., Nascimento, M., Lotufo, R., and Nogueira, R. (2022). mrobust04: A multilingual version of the trec robust 2004 benchmark. arXiv preprint arXiv:2209.13738. [link] DOI: 10.48550/arXiv.2209.13738
Johnson, J., Douze, M., and Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547. [link] DOI: 10.48550/arXiv.1702.08734
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., and Mikolov, T. (2016a). Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651. [link] DOI: 10.48550/arXiv.1612.03651
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016b). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. [link] DOI: 10.48550/arXiv.1607.01759
Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., et al. (2022). Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249. [link] DOI: 10.48550/arXiv.2205.13147
Lawrie, D., Mayfield, J., Oard, D. W., and Yang, E. (2022). Hc4: A new suite of test collections for ad hoc clir. In European Conference on Information Retrieval, pages 351–366. Springer. [link] DOI: 10.1007/978-3-030-99736-6_24
Lewis, D. D., Yang, Y., Russell-Rose, T., and Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397. [link]
Lima de Oliveira, L., Romeu, R. K., and Moreira, V. P. (2021). Regis: A test collection for geoscientific documents in portuguese. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2363–2368. [link] DOI: 10.1145/3404835.3463256
Nair, S., Yang, E., Lawrie, D., Duh, K., McNamee, P., Murray, K., Mayfield, J., and Oard, D. W. (2022). Transfer learning approaches for building cross-language dense retrieval models. In European Conference on Information Retrieval, pages 382–396. Springer. [link] DOI: 10.1007/978-3-030-99736-6_26
Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., and Liang, X. (2018). doccano: Text annotation tool for human. Software available from [link]. [link]
Overwijk, A., Xiong, C., and Callan, J. (2022). Clueweb22: 10 billion web documents with rich information. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3360–3362. [link] DOI: 10.1145/3477495.3536321
Peters, C. and Braschler, M. (2002). The importance of evaluation for cross-language system development: the clef experience. In LREC. Citeseer. [link]
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446. [link] DOI: 10.48550/arXiv.2112.11446
Sakai, T., Oard, D. W., and Kando, N. (2021). Evaluating Information Retrieval and Access Tasks: NTCIR’s Legacy of Research Impact. Springer Nature. [link] DOI: 10.1007/978-981-15-5554-1
Schäuble, P. and Sheridan, P. (1998). Cross-language information retrieval (clir) track overview. NIST SPECIAL PUBLICATION SP, pages 31–44. [link]
Thomas, P., Spielman, S., Craswell, N., and Mitra, B. (2023). Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621. [link] DOI: 10.48550/arXiv.2309.10621
Vitório, D., Souza, E., Martins, L., da Silva, N. F., de Carvalho, A. C. P. d. L., Oliveira, A. L., and de Andrade, F. E. (2024). Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the brazilian chamber of deputies. Language Resources and Evaluation, pages 1–21. [link] DOI: 10.1007/s10579-024-09767-3
Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. (2022). Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533. [link] DOI: 10.48550/arXiv.2212.03533
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc. [link]
Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. (2020). mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934. [link] DOI: 10.48550/arXiv.2010.11934
Zendel, O., Culpepper, J. S., Scholer, F., and Thomas, P. (2024). Enhancing human annotation: Leveraging large language models and efficient batch processing. In Proceedings of the 2024 Conference on Human Information Interaction and Retrieval, pages 340–345. [link]
Zhang, X., Thakur, N., Ogundepo, O., Kamalloo, E., Alfonso-Hermelo, D., Li, X., Liu, Q., Rezagholizadeh, M., and Lin, J. (2023). Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114–1131. [link] DOI: 10.1162/tacl_a_00595