Multi-Document Summarization Using Complex and Rich Features

  • Maria Lucía del Rosario Castro Jorge USP
  • Verônica Agostini USP
  • Thiago Alexandre Salgueiro Pardo USP

Resumo


Multi-document summarization consists in automatically producing a unique informative summary from a collection of texts on the same topic. In this paper we model the multi-document summarization task as a problem of machine learning classification where sentences from the source texts have to be classified as belonging or not to the summary. For this aim, we combine superficial (e.g., sentence position in the text) and deep linguistic features (e.g. semantic relations across documents). In particular, the linguistic features are given by CST (Cross-document Structure Theory). We conduct our experiments on a CST-annotated corpus of news texts. Results show that linguistic features help to produce a better classification model, producing state-of-the-art results.

Referências

Afantenos, S.D.; Doura, I.; Kapellou, E.; Karkaletsis, V. (2004). Exploiting Cross-Document Relations for Multi-document Evolving Summarization. In the Proceedings of SETN, pp. 410-419.

Aleixo, P. and Pardo, T.A.S. (2008). CSTNews: Um Córpus de Textos Journalísticos Anotados segundo a Teoria Discursiva CST ( Cross-Document Structure Theory ). Série de Relatórios Técnicos do Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, no. 326. São Carlos-SP.

Antiqueira, L. and Nunes, M.G.V. (2010). Complex Networks and Extractive Summarization. In the Extended Activities Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language – PROPOR. Porto Alegre/RS, Brazil.

Chuang W.T. and Yang J. (2000). Extracting sentence segments for text summarization: a machine learning approach. In the Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 152-159. Athens, Greece.

Costa, L.F.; Rodrigues, F.A.; Travieso, G.; Boas, P.R.V. (2007). Characterization of complex networks: A survey of measurements. Advances in Physics, Vol. 56, pp. 167-242.

Edmundson, H. P. (1969). New Methods in automatic extracting. Journal of the ACM, Vol. 16, pp. 264-285.

Jorge, M.L.C. and Pardo, T.A.S. (2009). Content Selection Operators for Multidocument Summarization based on Cross-document Structure Theory. In the Proceedings of the 7th Brazilian Symposium in Information and Human Language Technology – STIL, pp. 1-8. São Carlos/SP, Brazil.

Jorge, M.L.C. and Pardo, T.A.S. (2010a). Formalizing CST-based Content Selection Operations. In the Proceedings of the International Conference on Computational Processing of Portuguese Language-PROPOR. Porto Alegre/RS, Brazil.

Jorge, M.L.C. and Pardo, T.A.S. (2010b). Experiments with CST-based Multidocument Summarization. In the Proceedings of the ACL Workshop TextGraphs-5: Graph-based Methods for Natural Language Processing, pp. 74-82. Uppsala/Sweden.

Kupiec, J.; Pedersen, J.; Chen, F. (1995). A trainable document summarizer. In the Proceedings of the 18th ACMSIGIR Conference on Research & Development in Information Retrieval, pp. 68-73.

Lin, C.Y. and Hovy, E. (2003). Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In the Proceedings of Language Technology Conference – HLT-NAACL. Edmonton/Canada.

Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, Vol. 2, pp. 159-165.

Mani, I. and Bloedorn, E. (1997). Multi-document summarization by graph search and matching. In the Proceedings of the 14th National Conference on Artificial Intelligence – AAAI, pp. 622-628.

Mani, I. and Bloedorn, E. (1998). Machine Learning of Generic and User-Focused Summarization. In the Proceedings of the Fifteenth National Conference on Artificial Intelligence – AAAI, pp. 821-826.

Mani, I. and Maybury, M.T. (1999). Advances in automatic text summarization. MIT Press, Cambridge, MA.

Mann, W.C. and Thompson, S.A. (1987). Rhetorical Structure Theory: A Theory of Text Organization. Technical Report ISI/RS-87-190.

Maziero, E.G.; Jorge, M.L.C.; Pardo, T.A.S. (2010). Identifying Multidocument Relations. In the Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science-NLPCS, pp.60-69. Funchal/Madeira, Portugal.

McKeown, K. and Radev, D.R. (1995). Generating summaries of multiple news articles. In the Proceedings of the 18th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 74-82. Seattle, WA.

Otterbacher, J.C.; Radev, D.R.; Luo, A. (2002). Revisions that improve cohesion in multi-document summaries: a preliminary study. In the Proceedings of the Workshop on Automatic Summarization, pp 27-36. Philadelphia.

Quinlan, J.R. (1992). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco/CA, USA.

Radev, D.R. (2000). A common theory of information fusion from multiple text sources, step one: Cross-document structure. In the Proceedings of the 1st ACL SIGDIAL Workshop on Discourse and Dialogue. Hong Kong.

Radev, D.R. and McKeown, K. (1998). Generating natural language summaries from multiple on-line sources. Computational Linguistics, Vol. 24, N. 3, pp. 469-500.

Radev, D.R.; Jing, H.; Budzikowska, M. (2000). Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation and user studies. In the Proceedings of the ANLP/NAACL Workshop, pp. 21-29.

Radev, D.R.; Blair-Goldensohn, S.; Zhang, Z. (2001). Experiments in single and multi-document summarization using MEAD. In the Proceedings of the First Document Understanding Conference. New Orleans, LA.

Witten, I.H. and Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

Wnek, J. (1995). DIAV 2.0. User Manual: Specification and Guide through the Diagrammatic Visualization System. Reports of the Machine Learning and Inference Laboratory, George Mason University.

Yang, J.; Parekh, R.; Honavar V. (1999). DistAl: An inter-pattern distance-based constructive learning algorithm. Intelligent Data Analysis, Vol. 3, pp. 55-73.

Zhang, Z.; Goldenshon, S.B.; Radev, D.R. 2002. Towards CST-Enhanced Sumarization. In the Proceedings of the 18th National Conference on Artificial Intelligence. Edmonton/Canada.
Publicado
19/07/2011
JORGE, Maria Lucía del Rosario Castro; AGOSTINI, Verônica; PARDO, Thiago Alexandre Salgueiro. Multi-Document Summarization Using Complex and Rich Features. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 8. , 2011, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2011 . p. 299-310. ISSN 2763-9061.

Artigos mais lidos do(s) mesmo(s) autor(es)

<< < 1 2