References & Citations
Computer Science > Computational Complexity
Title: On the Complexity of Sorted Neighborhood
(Submitted on 8 Jan 2015)
Abstract: Record linkage concerns identifying semantically equivalent records in databases. Blocking methods are employed to avoid the cost of full pairwise similarity comparisons on $n$ records. In a seminal work, Hernandez and Stolfo proposed the Sorted Neighborhood blocking method. Several empirical variants have been proposed in recent years. In this paper, we investigate the complexity of the Sorted Neighborhood procedure on which the variants are built. We show that achieving maximum performance on the Sorted Neighborhood procedure entails solving a sub-problem, which is shown to be NP-complete by reducing from the Travelling Salesman Problem. We also show that the sub-problem can occur in the traditional blocking method. Finally, we draw on recent developments concerning approximate Travelling Salesman solutions to define and analyze three approximation algorithms.
Link back to: arXiv, form interface, contact.