Sentence Classification and Information Retrieval for Petroleum Engineering
Resumo
Classifying sentences in industrial, technical or scientific reports can enhance text mining and information retrieval tasks with useful machinereadable metadata. This paper describes a search engine that employs sentence classification so as to search for abstracts from scholarly papers in Petroleum Engineering. The sentences were classified into four classes, based on the popular IMRAD categories. We produced a dataset containing more than 2,200 manually labeled sentences from 278 scholarly articles in the field of Petroleum Engineering in order to be used as training and testing data. The classifier with best results was logistic regression, with an accuracy of 86.4%. The information retrieval system built on top of the classification system yielded a mAP of 0.80.
Referências
Bird, S., Klein, E., and Lopen, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc.
Chen, S. F. and Goodman., J. (1998). An empirical study of smoothing techniques for language modeling. Harvard Computer Science Group Technical Report.
Elssied, N. O. F., Ibrahim, O., and Osman, A. H. (2014). A novel feature selection based on one-way anova f-test for e-mail spam classification. Research Journal of Applied Sciences, Engineering and Technology, 7(3):625–638.
Furtado, P. H. T. (2017). Interpretação automática de relatórios de operação de mentos. Master’s thesis, Pontifı́cia Universidade Católica do Rio de Janeiro.
Ganguly, D., Roy, D., Mitra, M., and Jones, G. J. (2015). Word embedding based generalized language model for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, pages 795–798, New York, NY, USA. ACM.
Joachims, T. (1999). Transductive inference for text classification using support vector machines. In ICML, volume 99, pages 200–209.
Jurafsky, D. and Martin, J. H. (2018). Speech and Language Processing, volume 3. Pearson London.
Ladeira, A. P. and Alvarenga, L. (2012). Processamento de linguagem natural: em busca de evidências temáticas nas publicações nacionais e contemporâneas.
Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Rtrieval. Cambridge University Press, New York, NY, USA.
McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica: Biochemia medica, 22(3):276–282.
McKnight, L. and Srinivasan, P. (2003). Categorization of sentence types in medical abstracts. In AMIA Annual Symposium Proceedings, volume 2003, page 440. American Medical Informatics Association.
Mosteller, F. and Wallace, D. (1964). Inference and disputed authorship: The federalist.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543.
Schütze, H., Manning, C. D., and Raghavan, P. (2008). Introduction to information rtrieval, volume 39. Cambridge University Press.
Tu, C.-J., Chuang, L.-Y., Chang, J.-Y., and Yang, C.-H. (2007). Feature selection using pso-svm. IAENG International Journal of Computer Science, 33(1):111–116.
Wang, S. and Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 90–94. Association for Computational Linguistics.
Yamamoto, Y. and Takagi, T. (2005). A sentence classification system for multi biomedical literature summarization. In Data Engineering Workshops, 2005. 21st International Conference on, pages 1163–1163. IEEE.
Yun, C., Shin, D., Jo, H., Yang, J., and Kim, S. (2007). An experimental study on feature subset selection methods. In 7th IEEE International Conference on Computer and Information Technology (CIT 2007), pages 77–82. IEEE.
Zhai, C. and Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179–214.