Efficient and effective filtering of duplication detection in large database applications

Zhang, Ji (2012) Efficient and effective filtering of duplication detection in large database applications. Journal of Software, 7 (11). pp. 2424-2436. ISSN 1796-217X


In this paper, a robust filtering technique, called PC-Filter (PC stands for partition comparison), is proposed for effective and efficient duplicate record detection in large databases. PC-Filter distinguishes itself from all of existing methods by using record partitions in duplicate detection. PC-Filter operates in three steps. It first sorts the whole database and splits the sorted database into a number of record partitions. The Partition Comparison Graph (PCG) is then generated by performing fast partition pruning. Finally, duplicate records are effectively detected through internal
and external partition comparison based on PCG. Four closure properties, used as heuristics, have been devised to achieve a remarkable efficiency of the filter based on triangle inequity of record similarity. The partition size is well specified such that the time complexity of PC-Filter can be optimized. By equipping existing detection methods with PC-Filter, we are able to well solve the major problems that the existing methods suffer.

Statistics for USQ ePrint 22597
Statistics for this ePrint Item
Item Type: Article (Commonwealth Reporting Category C)
Refereed: Yes
Item Status: Live Archive
Additional Information: This is an open access title. Awaiting copyright permission to make Full Text available.
Faculty / Department / School: Historic - Faculty of Sciences - Department of Maths and Computing
Date Deposited: 02 Jan 2013 23:22
Last Modified: 01 Feb 2017 00:00
Uncontrolled Keywords: filtering; duplicate record detection; database management; pattern recognition
Fields of Research : 08 Information and Computing Sciences > 0801 Artificial Intelligence and Image Processing > 080109 Pattern Recognition and Data Mining
Identification Number or DOI: 10.4304/jsw.7.11.2424-2436
URI: http://eprints.usq.edu.au/id/eprint/22597

Actions (login required)

View Item Archive Repository Staff Only