Title: Classification and Clustering for Record Linkage in Large Datasets
Abstract: Record linkage, or the process of linking records corresponding to unique entities within and/or across data sources, is an increasingly important problem in today’s data-rich world. Due to issues like typographical errors, name variation, and repetition of common names, linking records of unique entities within and across large data sources can be a difficult task, in terms of both accuracy and computational feasibility. We frame record linkage as a clustering problem, where the objects to be clustered are the records in the data source(s), and the clusters are the unique entities to which the records correspond. We use the following three-step approach for record linkage: First, records are partitioned into blocks of loosely similar records to reduce the comparison space and ensure computational feasibility. We propose a sequential blocking approach that iterates through a nested set of decreasingly strict blocking criteria to reduce the comparison space more efficiently. Second, we adopt an ensemble supervised learning approach to estimate the probability that a pair of records matches. We propose a new adaptive prediction approach for classifier ensembles (specifically, random forests) that extracts and incorporates summary statistic information from the distribution of estimated probabilities. Third, after transforming our estimated pairwise probabilities of matching to pairwise dissimilarities, we use hierarchical clustering to link matching records. We apply these approaches to two labeled record linkage datasets: a set of labeled inventors from the United States Patent and Trademark Office database and a compilation of lists of death records from the Syrian Civil War conflict.