PPORTAL: Public Domain Portuguese-language Literature Dataset

  • Mariana O. Silva Universidade Federal de Minas Gerais
  • Clarisse Scofield Universidade Federal de Minas Gerais
  • Mirella M. Moro Universidade Federal de Minas Gerais https://orcid.org/0000-0002-0545-2001

Resumo


Combining human expertise with book-consumers data may generate what is needed to sustain constant changes experienced in the book publishing market. Then, building and making available datasets that entirely comprise the essential elements of the book industry ecosystem is essential. However, little has been done in such a context for non-English languages, such as Portuguese. Hence, we introduce PPORTAL, a public domain Portuguese-language literature dataset composed of books-related metadata. After an overview of its building process and content, we discuss a brief exploratory data analysis to summarize its main characteristics. We also highlight potential applications, showing how PPORTAL is useful as a resource on different research domains.
Palavras-chave: Literature, Portuguese, Dataset, Machine Learning

Referências

Ahmad, J., Duraisamy, P., Yousef, A., and Buckles, B. (2017). Movie success prediction using data mining. In Int’l Conf. on Computing, Communication and Networking Technologies (ICCCNT), pages 1–4. doi:10.1109/ICCCNT.2017.8204173

Champagne, A. (2020). What Is A Reader? How Readers on Goodreads are Changing the Canon in the Twenty-First Century. In 15th Annual International Conference of the Alliance of Digital Humanities Organizations, Conference Abstracts.

de Araujo, P. H. L., de Campos, T. E., de Oliveira, R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). Lener-br: a dataset for named entity recognition in brazilian legal text. In Int’l Conf. on Computational Processing of the Portuguese Language, pages 313–323. Springer

Lebrun, T. and Audet, R. (2020). Artificial Intelligence and the Book Industry. White Paper. Zenodo. doi:10.5281/zenodo.4036258

Lozano, L. C. and Planells, S. C. (2020). Best books ever dataset. Zenodo. doi:10.5281/zenodo.4265096

Maharjan, S., Kar, S., Montes, M., González, F. A., and Solorio, T. (2018). Letting emotions flow: Success prediction by modeling the flow of emotions in books. In Procs. Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 259–265. doi:10.18653/v1/N18-2042

Maity, S. K., Panigrahi, A., and Mukherjee, A. (2019). Analyzing Social Book Reading Behavior on Goodreads and How It Predicts Amazon Best Sellers, pages 211–235. Springer International Publishing, Cham.

Martín-Gutiérrez, D., Hernández Peñaloza, G., Belmonte-Hernández, A., and Álvarez García, F. (2020). A multimodal end-to-end deep learning architecture for music popularity prediction. IEEE Access, 8:39361–39374. doi:10.1109/ACCESS.2020.2976033.

Ni, J., Li, J., and McAuley, J. (2019). Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Procs. Conf. on Empirical Methods in Natural Language Processing and Int’l Joint Conf. on Natural Language Processing (EMNLPIJCNLP), pages 188–197.

Rigau, P. and Tienda, A. (2020). 100 bestselller books during covid-19 in spain. Zenodo. doi:10.5281/zenodo.3820050.

Sabri, N. and Weber, I. (2021). A global book reading dataset. Data, 6(8):83. doi:10.3390/data6080083

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1–47. doi:10.1145/505282.505283

Silva, M., Scofield, C., Oliveira, G., Seufitelli, D., and Moro, M. (2021a). Exploring Brazilian Cultural Identity Through Reading Preferences. In Anais do X Brazilian Workshop on Social Network Analysis and Mining, pages 115–126. SBC. doi:10.5753/brasnam.2021.16130

Silva, M. O., Scofield, C., and Moro, M. M. (2021b). PPORTAL: Public domain Portuguese-language literature Dataset. Zenodo. doi:10.5281/zenodo.5178063

Silva, M. O., Scofield, C., Oliveira, G. P., Seufitelli, D. B., and Moro, M. M. (2021c). BraCID: Brazilian Cultural Identity Information Through Reading Preferences. Zenodo. doi:10.5281/zenodo.4890048

Soares, F., Yamashita, G. H., and Anzanello, M. J. (2018). A parallel corpus of theses and dissertations abstracts. In International Conference on Computational Processing of the Portuguese Language, pages 345–352. Springer

Sousa, A. W. and Fabro, M. D. D. (2019). Iudicium textum dataset uma base de textos jurídicos para nlp. In XXXIV Simpósio Brasileiro de Banco de Dados: Dataset Showcase Workshop, SBBD 2019 Companion. SBC.

Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brwac corpus: A new open resource for brazilian portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).

Wan, M., Misra, R., Nakashole, N., and McAuley, J. J. (2019). Fine-grained spoiler detection from large-scale review corpora. In Procs. Conf. of the Association for Computational Linguistics (ACL), pages 2605–2610. doi:10.18653/v1/p19-1

Wang, X., Yucesoy, B., Varol, O., Eliassi-Rad, T., and Barabasi, A.-L. (2019). Success in books: predicting book sales before publication. EPJ Data Science, 8(31). doi:10.1140/epjds/s13688-019-0208-6

Yadollahi, A., Shahraki, A. G., and Zaiane, O. R. (2017). Current state of text sentiment analysis from opinion to emotion mining. ACM Comput. Surv., 50(2). doi:10.1145/3057270
Publicado
04/10/2021
SILVA, Mariana O.; SCOFIELD, Clarisse; MORO, Mirella M.. PPORTAL: Public Domain Portuguese-language Literature Dataset. In: DATASET SHOWCASE WORKSHOP (DSW), 3. , 2021, Rio de Janeiro. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 77-88. DOI: https://doi.org/10.5753/dsw.2021.17416.