TSeg – A Text Segmenter for Corpus Annotation
Resumo
This paper describes TSeg – a Java application that allows for both manual and automatic segmentation of a source text into basic units of annotation. TSeg provides a straightforward way to approach this task through a clear point-and-click interface. Once finished the text segmentation, the application outputs an XML file that may be used as input to a more problem specific annotation software. Hence, TSeg moves the identification of basic units of annotation out of the task of annotating these units, making it possible for both problems to be analysed in isolation, thereby reducing the cognitive load on the user and preventing potential damages to the overall outcome of the annotation process.
Palavras-chave:
TSeg, Text Segmenter, Corpus Annotation
Referências
Beeferman, D., Berger, A., and Lafferty, J. (1999). Statistical models for text segmentation. Machine Learning, 34:177—-210.
Beineke, P., Hastie, T., Manning, C., and Vaithyanathan, S. (2004). An exploration of sentiment summarization. In AAAI Spring Symposium: Exploring Attitude and Affect in Text: Theories and Applications, Stanford, USA. Technical Report SS-04-07.
Birnbaum, M. H. (2004). Human research and data collection via the internet. Annual Review of Psychology, 55:803–832.
Bolshakov, I. A. and Gelbukh, A. F. (2001). Text segmentation into paragraphs based on local text cohesion. In Proceedings of the 4th International Conference on Text, Speech and Dialogue (TSD ’01), pages 158–166, Zelezna Ruda, Czech Republic.
Craggs, R. and Wood, M. M. (2004). A two dimensional annotation scheme for emotion in dialogue. In AAAI Spring Symposium: Exploring Attitude and Affect in Text: Theories and Applications, Stanford, USA. Technical Report SS-04-07.
Golcher, F. (2006). Statistical text segmentation with partial structure analysis. In Proceedings of 8th Conference on Natural Language Processing (KONVENS 2006), pages 44–51, Konstanz, Denmark.
Ide, N. and Brew, C. (2000). Requirements, tools, and architectures for annotated corpora. In Proceedings of Data Architectures and Software Support for Large Corpora, pages 1–5, Paris, France. European Language Resources Association.
Kazantseva, A. and Szpakowicz, S. (2011). Linear text segmentation using affinity propagation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pages 284—-293, Edinburgh, Scotland, UK.
Kern, R. and Granitzer, M. (2009). Efficient linear text segmentation based on information retrieval techniques. In Proceedings of the International Conference on Management of Emergent Digital EcoSystems (MEDES ’09), pages 167–171, Lyon, France.
Maeda, K., Lee, H., Medero, S., Medero, J., Parker, R., and Strassel, S. (2008). Annotation tool development for large-scale corpus creation projects at the linguistic data consortium. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
Muller, C. and Strube, M. (2006). Multi-level annotation of linguistic data with MMAX2.In Braun, S., Kohn, K., and Mukherjee, J., editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pages 197–214. Peter Lang, Frankfurt a.M., Germany.
O’Donnell, M. (2008). The uam corpustool: software for corpus annotation and exploration. In Proceedings of the XXVI Congreso de AESLA, Almeria, Spain.
Ogren, P. V. (2006). Knowtator: A plug-in for creating training and evaluation data sets for biomedical natural language systems. In Proceedings of the 9th International Proteg ́ e ́Conference, Stanford, USA.
Orasan, C. (2003). Palinka: A highly customisable tool for discourse annotation. In ̆ Proceedings of the 4th SIGdial Workshop on Discourse and Dialog, pages 39––43, Sapporo, Japan.
Przepiorkowski, A. and Ba nski, P. (2009). Which xml standards for multilevel corpusánnotation? In Proceedings of the 4th Language and Technology Conference, LTC 2009, Poznan, pages 400–411, Poznan, Poland.
Reidsma, D., sa Jovanovic, N., and Hofs, D. (2005). Designing annotation tools basedón properties of annotation problems. In Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research.
Roman, N. T. (2007). Emoção e a Sumarização Automática de Diálogos . PhD thesis, Instituto de Computação – Universidade Estadual de Campinas, Campinas, São Paulo.
Rubin, V., Stanton, J., and Liddy, E. (2004). Discerning emotions in texts. In AAAI Spring Symposium: Exploring Attitude and Affect in Text: Theories and Applications, Stanford, USA. Technical Report SS-04-07.
Utiyama, M. and Isahara, H. (2001). A statistical model for domain-independent text segmentation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (ACL ’01), Toulouse, France.
van der Vliet, N., Berzlanovich, I., Bouma, G., Egg, M., and Redeker, G. (2011). Building a discourse-annotated dutch text corpus. Bochumer Linguistische Arbeitsberichte, 3:157–171. ISSN: 2190-0949.
Varasai, P., Pechsiri, C., Sukvari, T., Satayamas, V., and Kawtrakul, A. (2008). Building an annotated corpus for text summarization and question answering. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
Verhagen, M. (2010). The brandeis annotation tool. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), pages 3638– 3643, Valletta, Malta.
Zhao, C., Mahmud, J., and Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proceedings of the SIAM International Conference on Data Mining (SMD08), Atlanta, USA.
Beineke, P., Hastie, T., Manning, C., and Vaithyanathan, S. (2004). An exploration of sentiment summarization. In AAAI Spring Symposium: Exploring Attitude and Affect in Text: Theories and Applications, Stanford, USA. Technical Report SS-04-07.
Birnbaum, M. H. (2004). Human research and data collection via the internet. Annual Review of Psychology, 55:803–832.
Bolshakov, I. A. and Gelbukh, A. F. (2001). Text segmentation into paragraphs based on local text cohesion. In Proceedings of the 4th International Conference on Text, Speech and Dialogue (TSD ’01), pages 158–166, Zelezna Ruda, Czech Republic.
Craggs, R. and Wood, M. M. (2004). A two dimensional annotation scheme for emotion in dialogue. In AAAI Spring Symposium: Exploring Attitude and Affect in Text: Theories and Applications, Stanford, USA. Technical Report SS-04-07.
Golcher, F. (2006). Statistical text segmentation with partial structure analysis. In Proceedings of 8th Conference on Natural Language Processing (KONVENS 2006), pages 44–51, Konstanz, Denmark.
Ide, N. and Brew, C. (2000). Requirements, tools, and architectures for annotated corpora. In Proceedings of Data Architectures and Software Support for Large Corpora, pages 1–5, Paris, France. European Language Resources Association.
Kazantseva, A. and Szpakowicz, S. (2011). Linear text segmentation using affinity propagation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pages 284—-293, Edinburgh, Scotland, UK.
Kern, R. and Granitzer, M. (2009). Efficient linear text segmentation based on information retrieval techniques. In Proceedings of the International Conference on Management of Emergent Digital EcoSystems (MEDES ’09), pages 167–171, Lyon, France.
Maeda, K., Lee, H., Medero, S., Medero, J., Parker, R., and Strassel, S. (2008). Annotation tool development for large-scale corpus creation projects at the linguistic data consortium. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
Muller, C. and Strube, M. (2006). Multi-level annotation of linguistic data with MMAX2.In Braun, S., Kohn, K., and Mukherjee, J., editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pages 197–214. Peter Lang, Frankfurt a.M., Germany.
O’Donnell, M. (2008). The uam corpustool: software for corpus annotation and exploration. In Proceedings of the XXVI Congreso de AESLA, Almeria, Spain.
Ogren, P. V. (2006). Knowtator: A plug-in for creating training and evaluation data sets for biomedical natural language systems. In Proceedings of the 9th International Proteg ́ e ́Conference, Stanford, USA.
Orasan, C. (2003). Palinka: A highly customisable tool for discourse annotation. In ̆ Proceedings of the 4th SIGdial Workshop on Discourse and Dialog, pages 39––43, Sapporo, Japan.
Przepiorkowski, A. and Ba nski, P. (2009). Which xml standards for multilevel corpusánnotation? In Proceedings of the 4th Language and Technology Conference, LTC 2009, Poznan, pages 400–411, Poznan, Poland.
Reidsma, D., sa Jovanovic, N., and Hofs, D. (2005). Designing annotation tools basedón properties of annotation problems. In Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research.
Roman, N. T. (2007). Emoção e a Sumarização Automática de Diálogos . PhD thesis, Instituto de Computação – Universidade Estadual de Campinas, Campinas, São Paulo.
Rubin, V., Stanton, J., and Liddy, E. (2004). Discerning emotions in texts. In AAAI Spring Symposium: Exploring Attitude and Affect in Text: Theories and Applications, Stanford, USA. Technical Report SS-04-07.
Utiyama, M. and Isahara, H. (2001). A statistical model for domain-independent text segmentation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (ACL ’01), Toulouse, France.
van der Vliet, N., Berzlanovich, I., Bouma, G., Egg, M., and Redeker, G. (2011). Building a discourse-annotated dutch text corpus. Bochumer Linguistische Arbeitsberichte, 3:157–171. ISSN: 2190-0949.
Varasai, P., Pechsiri, C., Sukvari, T., Satayamas, V., and Kawtrakul, A. (2008). Building an annotated corpus for text summarization and question answering. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
Verhagen, M. (2010). The brandeis annotation tool. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), pages 3638– 3643, Valletta, Malta.
Zhao, C., Mahmud, J., and Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proceedings of the SIAM International Conference on Data Mining (SMD08), Atlanta, USA.
Publicado
16/05/2012
Como Citar
RODRIGUES, Felipe; SEMOLINI, Richard; ROMAN, Norton Trevisan; MONTEIRO, Ana Maria.
TSeg – A Text Segmenter for Corpus Annotation. In: SIMPÓSIO BRASILEIRO DE SISTEMAS DE INFORMAÇÃO (SBSI), 8. , 2012, São Paulo.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2012
.
p. 353-362.
DOI: https://doi.org/10.5753/sbsi.2012.14419.