Semi-supervised Semantic Role Labeling for Brazilian Portuguese
Keywords:Brazilian Portuguese SRL, PropBank-br, Semantic Role Labeling, Semi-Supervised Learning
Semantic Role Labeling (SRL) is a natural language processing task that detects the arguments of predicates (usually verbs) and their semantic roles. Such roles characterize semantic relationships between an event and its participants, as who did what to whom, where, when and how, which is very useful to improve a wide range of tasks, such as information extraction and plagiarism dectection to name a few. Commonly, a supervised classifier is trained over large English annotated resources in order to perform the prediction of unlabeled sentences. However, most part of non-English languages suffers from scarcity of annotated data, as the labeling process is expensive, time consuming and requires the efforts of human annotators. Although such limitation makes harder the training of supervised methods for those languages, it indicates an appropriate scenario to apply semi-supervised learning (SSL) methods, which are able to learn not only from labeled data, but also from the unlabeled ones. In this article, we investigate SSL methods in the classification of semantic roles for the Brazilian Portuguese, a relatively resource-poor language. Specifically, a representative set of SSL methods based on low density separation, graphs and self-training are considered. Experiments on the PropBank-br, a Brazilian Portuguese corpus built with text from Brazilian newspapers, were performed varying the number of labeled arguments. Additionaly, the SSL methods were compared against state-of-the-art SRL methods. The results demonstrated that self-training heuristic outperforms other SSL and supervised methods, even when the latter are trained on a high number of labeled arguments.