Semi-Automatic Construction of a Textual Entailment Dataset: Selecting Candidates with Vector Space Models

Erick R. Fonseca; Sandra M. Aluísio

Semi-Automatic Construction of a Textual Entailment Dataset: Selecting Candidates with Vector Space Models

Erick R. Fonseca USP
Sandra M. Aluísio USP

Resumo

Recognizing Textual Entailment (RTE) is an NLP task aimed at detecting whether the meaning of a given piece of text entails the meaning of another one. Despite its relevance to many NLP areas, it has been scarcely explored in Portuguese, mainly due to the lack of labeled data. A dataset for RTE must contain both positive and negative examples of entailment, and neither should be obvious: negative examples shouldn't be completely unrelated texts and positive examples shouldn't be too similar. We report here an ongoing work to address this difficulty using Vector Space Models (VSMs) to select candidate pairs from news clusters. We compare three different VSMs, and show that Latent Dirichlet Allocation achieves promising results, yielding both good positive and negative examples.

PDF (English)

Publicado

04/11/2015

Como Citar

Selecione um Formato

FONSECA, Erick R.; ALUÍSIO, Sandra M.. Semi-Automatic Construction of a Textual Entailment Dataset: Selecting Candidates with Vector Space Models. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 1. , 2015, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2015 . p. 201-210.