TY - JOUR AU - Teixeira, Carlos H. C. AU - Silva, Arlei AU - Meira Jr., Wagner PY - 2012/10/15 Y2 - 2024/03/29 TI - Min-Hash Fingerprints for Graph Kernels: A Trade-off among Accuracy, Efficiency, and Compression JF - Journal of Information and Data Management JA - JIDM VL - 3 IS - 3 SE - SBBD Articles DO - 10.5753/jidm.2012.1450 UR - https://sol.sbc.org.br/journals/index.php/jidm/article/view/1450 SP - 227 AB - <div>Graph databases that emerge from several relevant scenarios (e.g., social networks, the Web) require powerful data management algorithms and techniques. A fundamental operation in graph data management is computing the similarity between two graphs. However, due to the large scale and high dimensionality of real graph databases, computing graph similarity becomes a challenging problem in real settings. </div><div><br /></div><div>Many graph data management tasks, such as graph mining, classification, and retrieval, can be contextualized in the framework of graph kernels. A graph kernel is, roughly speaking, a function that computes the similarity between graph structures as means to enable the application of linear methods to graph data. Nevertheless, large databases usually require the use of compact representations of graphs known as graph fingerprints (or signatures). Graph fingerprinting techniques provide a solution that is a trade-off among accuracy, efficiency, and compression in graph kernels. </div><div><br /></div><div>In this paper, we study the problem of generating fingerprints for graph kernels. We introduce a graph fingerprinting technique based on the min-hashing scheme, which is a powerful strategy for computing the similarity between large sets of items using a small amount of data. An algorithm for the generation of graph fingerprints as vectors of min-hash values is presented and integrated into the framework of graph kernels. Results show that graph fingerprinting achieves efficiency gains of up to one order of magnitude with up to 97% space savings when compared against the complete set of graph substructures. Moreover, the proposed technique is up to 9 times more accurate than a baseline method.</div> ER -