PC-filter: a Robust filtering technique for duplicate record detection in large databases

Zhang, Ji and Ling, Tok Wang and Bruckner, Robert and Liu, Han (2004) PC-filter: a Robust filtering technique for duplicate record detection in large databases. In: 15th International Conference on Database and Expert Systems Applications (DEXA'04), 30 August - 3 Sept 2004, Zaragoza, Spain.

Metadata

HTML CitationEndNoteDublin CoreReference Manager

Full text available as:

[img]
Preview
PDF (Accepted Version) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
543Kb

Official URL: http://www.informatik.uni-trier.de/~ley/db/conf/dexa/index.html

Abstract

[Abstract]: In this paper, we will propose PC-Filter (PC stands for Partition Comparison), a robust data filter for approximately duplicate record detection in large databases. PC-Filter distinguishes itself from all of existing methods by using the notion of partition in duplicate detection. It first sorts the whole database and splits the sorted database into a number of record partitions. The Partition Comparison Graph (PCG) is then constructed by performing fast partition pruning. Finally, duplicate records are effectively detected by using internal and external partition comparison based on PCG. Four properties, used as heuristics, have been devised to achieve a remarkable efficiency of the filter based on triangle inequity of record similarity. PC-Filter is insensitive to the key used to sort the database, and can achieve a very good recall level that is comparable to that of the pair-wise record comparison method but only with a complexity of O(N4/3). Equipping existing detection methods with PC-Filter, we are able to well solve the “Key Selection” problem, the “Scope Specification” problem and the “Low Recall” problem that the existing methods suffer from.

Item Type:Conference or Workshop Item (Commonwealth Reporting Category E) (Paper)
Additional Information:Author's version deposited in accordance with the copyright policy of the publisher. The original publication is available at www.springerlink.com)
Uncontrolled Keywords:PC-filter; partition comparision filter; duplicate record detection
Fields of Research (FOR2008):08 Information and Computing Sciences > 0801 Artificial Intelligence and Image Processing > 080109 Pattern Recognition and Data Mining
Subjects:280000 Information, Computing and Communication Sciences
Socio-Economic Objective (SEO2008):UNSPECIFIED
ID Code:5653
Deposited By:
Deposited On:10 Sep 2009 09:21
Last Modified:21 Jan 2010 11:39

Archive Staff Only: edit this record