A Fast Similarity Search kNN for Textual Datasets

Leonardo Afonso Amorim; Mateus F. Freitas; Paulo Henrique da Silva; Wellington S. Martins

Leonardo Afonso Amorim UFG
Mateus F. Freitas UFG
Paulo Henrique da Silva UFG
Wellington S. Martins UFG

Resumo

The k nearest neighbors (kNN) is an algorithm for finding the closest k points in metric spaces. Due to its high computational costs, many parallel solutions have been proposed, including some implementations targeted at modern accelerators. However, most approaches assume relatively low dimensionality and dense data. Such conditions do not apply to textual datasets, which are known for their high dimensionality and sparsity. This work presents a fine-grained parallel algorithm that applies filtering technique based on most common important terms of the query document using an inverted index and its implementation on GPU. Our method improves the top k nearest neighbors search in textual datasets by up to 37x with a single GPU.

Palavras-chave: Graphics processing units, Indexes, Approximation algorithms, Parallel algorithms, Sorting, Proposals, knn, textual datasets, high dimensionality, sparsity, filtering technique, fine-grained parallel algorithm