PPORTAL: Public Domain Portuguese-language Literature Dataset
Resumo
Combining human expertise with book-consumers data may generate what is needed to sustain constant changes experienced in the book publishing market. Then, building and making available datasets that entirely comprise the essential elements of the book industry ecosystem is essential. However, little has been done in such a context for non-English languages, such as Portuguese. Hence, we introduce PPORTAL, a public domain Portuguese-language literature dataset composed of books-related metadata. After an overview of its building process and content, we discuss a brief exploratory data analysis to summarize its main characteristics. We also highlight potential applications, showing how PPORTAL is useful as a resource on different research domains.
Palavras-chave:
Literature, Portuguese, Dataset, Machine Learning
Referências
Ahmad, J., Duraisamy, P., Yousef, A., and Buckles, B. (2017). Movie success prediction using data mining. In Int’l Conf. on Computing, Communication and Networking Technologies (ICCCNT), pages 1–4. doi:10.1109/ICCCNT.2017.8204173
Champagne, A. (2020). What Is A Reader? How Readers on Goodreads are Changing the Canon in the Twenty-First Century. In 15th Annual International Conference of the Alliance of Digital Humanities Organizations, Conference Abstracts.
de Araujo, P. H. L., de Campos, T. E., de Oliveira, R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). Lener-br: a dataset for named entity recognition in brazilian legal text. In Int’l Conf. on Computational Processing of the Portuguese Language, pages 313–323. Springer
Lebrun, T. and Audet, R. (2020). Artificial Intelligence and the Book Industry. White Paper. Zenodo. doi:10.5281/zenodo.4036258
Lozano, L. C. and Planells, S. C. (2020). Best books ever dataset. Zenodo. doi:10.5281/zenodo.4265096
Maharjan, S., Kar, S., Montes, M., González, F. A., and Solorio, T. (2018). Letting emotions flow: Success prediction by modeling the flow of emotions in books. In Procs. Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 259–265. doi:10.18653/v1/N18-2042
Maity, S. K., Panigrahi, A., and Mukherjee, A. (2019). Analyzing Social Book Reading Behavior on Goodreads and How It Predicts Amazon Best Sellers, pages 211–235. Springer International Publishing, Cham.
Martín-Gutiérrez, D., Hernández Peñaloza, G., Belmonte-Hernández, A., and Álvarez García, F. (2020). A multimodal end-to-end deep learning architecture for music popularity prediction. IEEE Access, 8:39361–39374. doi:10.1109/ACCESS.2020.2976033.
Ni, J., Li, J., and McAuley, J. (2019). Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Procs. Conf. on Empirical Methods in Natural Language Processing and Int’l Joint Conf. on Natural Language Processing (EMNLPIJCNLP), pages 188–197.
Rigau, P. and Tienda, A. (2020). 100 bestselller books during covid-19 in spain. Zenodo. doi:10.5281/zenodo.3820050.
Sabri, N. and Weber, I. (2021). A global book reading dataset. Data, 6(8):83. doi:10.3390/data6080083
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1–47. doi:10.1145/505282.505283
Silva, M., Scofield, C., Oliveira, G., Seufitelli, D., and Moro, M. (2021a). Exploring Brazilian Cultural Identity Through Reading Preferences. In Anais do X Brazilian Workshop on Social Network Analysis and Mining, pages 115–126. SBC. doi:10.5753/brasnam.2021.16130
Silva, M. O., Scofield, C., and Moro, M. M. (2021b). PPORTAL: Public domain Portuguese-language literature Dataset. Zenodo. doi:10.5281/zenodo.5178063
Silva, M. O., Scofield, C., Oliveira, G. P., Seufitelli, D. B., and Moro, M. M. (2021c). BraCID: Brazilian Cultural Identity Information Through Reading Preferences. Zenodo. doi:10.5281/zenodo.4890048
Soares, F., Yamashita, G. H., and Anzanello, M. J. (2018). A parallel corpus of theses and dissertations abstracts. In International Conference on Computational Processing of the Portuguese Language, pages 345–352. Springer
Sousa, A. W. and Fabro, M. D. D. (2019). Iudicium textum dataset uma base de textos jurídicos para nlp. In XXXIV Simpósio Brasileiro de Banco de Dados: Dataset Showcase Workshop, SBBD 2019 Companion. SBC.
Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brwac corpus: A new open resource for brazilian portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Wan, M., Misra, R., Nakashole, N., and McAuley, J. J. (2019). Fine-grained spoiler detection from large-scale review corpora. In Procs. Conf. of the Association for Computational Linguistics (ACL), pages 2605–2610. doi:10.18653/v1/p19-1
Wang, X., Yucesoy, B., Varol, O., Eliassi-Rad, T., and Barabasi, A.-L. (2019). Success in books: predicting book sales before publication. EPJ Data Science, 8(31). doi:10.1140/epjds/s13688-019-0208-6
Yadollahi, A., Shahraki, A. G., and Zaiane, O. R. (2017). Current state of text sentiment analysis from opinion to emotion mining. ACM Comput. Surv., 50(2). doi:10.1145/3057270
Champagne, A. (2020). What Is A Reader? How Readers on Goodreads are Changing the Canon in the Twenty-First Century. In 15th Annual International Conference of the Alliance of Digital Humanities Organizations, Conference Abstracts.
de Araujo, P. H. L., de Campos, T. E., de Oliveira, R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). Lener-br: a dataset for named entity recognition in brazilian legal text. In Int’l Conf. on Computational Processing of the Portuguese Language, pages 313–323. Springer
Lebrun, T. and Audet, R. (2020). Artificial Intelligence and the Book Industry. White Paper. Zenodo. doi:10.5281/zenodo.4036258
Lozano, L. C. and Planells, S. C. (2020). Best books ever dataset. Zenodo. doi:10.5281/zenodo.4265096
Maharjan, S., Kar, S., Montes, M., González, F. A., and Solorio, T. (2018). Letting emotions flow: Success prediction by modeling the flow of emotions in books. In Procs. Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 259–265. doi:10.18653/v1/N18-2042
Maity, S. K., Panigrahi, A., and Mukherjee, A. (2019). Analyzing Social Book Reading Behavior on Goodreads and How It Predicts Amazon Best Sellers, pages 211–235. Springer International Publishing, Cham.
Martín-Gutiérrez, D., Hernández Peñaloza, G., Belmonte-Hernández, A., and Álvarez García, F. (2020). A multimodal end-to-end deep learning architecture for music popularity prediction. IEEE Access, 8:39361–39374. doi:10.1109/ACCESS.2020.2976033.
Ni, J., Li, J., and McAuley, J. (2019). Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Procs. Conf. on Empirical Methods in Natural Language Processing and Int’l Joint Conf. on Natural Language Processing (EMNLPIJCNLP), pages 188–197.
Rigau, P. and Tienda, A. (2020). 100 bestselller books during covid-19 in spain. Zenodo. doi:10.5281/zenodo.3820050.
Sabri, N. and Weber, I. (2021). A global book reading dataset. Data, 6(8):83. doi:10.3390/data6080083
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1–47. doi:10.1145/505282.505283
Silva, M., Scofield, C., Oliveira, G., Seufitelli, D., and Moro, M. (2021a). Exploring Brazilian Cultural Identity Through Reading Preferences. In Anais do X Brazilian Workshop on Social Network Analysis and Mining, pages 115–126. SBC. doi:10.5753/brasnam.2021.16130
Silva, M. O., Scofield, C., and Moro, M. M. (2021b). PPORTAL: Public domain Portuguese-language literature Dataset. Zenodo. doi:10.5281/zenodo.5178063
Silva, M. O., Scofield, C., Oliveira, G. P., Seufitelli, D. B., and Moro, M. M. (2021c). BraCID: Brazilian Cultural Identity Information Through Reading Preferences. Zenodo. doi:10.5281/zenodo.4890048
Soares, F., Yamashita, G. H., and Anzanello, M. J. (2018). A parallel corpus of theses and dissertations abstracts. In International Conference on Computational Processing of the Portuguese Language, pages 345–352. Springer
Sousa, A. W. and Fabro, M. D. D. (2019). Iudicium textum dataset uma base de textos jurídicos para nlp. In XXXIV Simpósio Brasileiro de Banco de Dados: Dataset Showcase Workshop, SBBD 2019 Companion. SBC.
Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brwac corpus: A new open resource for brazilian portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Wan, M., Misra, R., Nakashole, N., and McAuley, J. J. (2019). Fine-grained spoiler detection from large-scale review corpora. In Procs. Conf. of the Association for Computational Linguistics (ACL), pages 2605–2610. doi:10.18653/v1/p19-1
Wang, X., Yucesoy, B., Varol, O., Eliassi-Rad, T., and Barabasi, A.-L. (2019). Success in books: predicting book sales before publication. EPJ Data Science, 8(31). doi:10.1140/epjds/s13688-019-0208-6
Yadollahi, A., Shahraki, A. G., and Zaiane, O. R. (2017). Current state of text sentiment analysis from opinion to emotion mining. ACM Comput. Surv., 50(2). doi:10.1145/3057270
Publicado
04/10/2021
Como Citar
SILVA, Mariana O.; SCOFIELD, Clarisse; MORO, Mirella M..
PPORTAL: Public Domain Portuguese-language Literature Dataset. In: DATASET SHOWCASE WORKSHOP (DSW), 3. , 2021, Rio de Janeiro.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2021
.
p. 77-88.
DOI: https://doi.org/10.5753/dsw.2021.17416.