Semi-Automatic Construction of a Textual Entailment Dataset: Selecting Candidates with Vector Space Models

  • Erick R. Fonseca USP
  • Sandra M. Aluísio USP

Resumo


Recognizing Textual Entailment (RTE) is an NLP task aimed at detecting whether the meaning of a given piece of text entails the meaning of another one. Despite its relevance to many NLP areas, it has been scarcely explored in Portuguese, mainly due to the lack of labeled data. A dataset for RTE must contain both positive and negative examples of entailment, and neither should be obvious: negative examples shouldn't be completely unrelated texts and positive examples shouldn't be too similar. We report here an ongoing work to address this difficulty using Vector Space Models (VSMs) to select candidate pairs from news clusters. We compare three different VSMs, and show that Latent Dirichlet Allocation achieves promising results, yielding both good positive and negative examples.

Publicado
04/11/2015
FONSECA, Erick R.; ALUÍSIO, Sandra M.. Semi-Automatic Construction of a Textual Entailment Dataset: Selecting Candidates with Vector Space Models. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 1. , 2015, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2015 . p. 201-210.