On memory and I/O efficient duplication detection for multiple self-clean data sources

Zhang, Ji and Shu, Yanfeng and Wang, Hua (2010) On memory and I/O efficient duplication detection for multiple self-clean data sources. In: DASFAA 2010: 15th International Conference on Database Systems for Advanced Applications , 1-4 Apr 2010, Tsukuba, Japan.

[img]
Preview
PDF (Documentation)
Binder2.pdf

Download (279Kb)

Abstract

In this paper, we propose efficient algorithms for duplicate detection from multiple data sources that are themselves duplicate-free. When developing these algorithms, we take the full consideration of various possible cases given the workload of data sources to be cleaned and the available memory. These algorithms are memory and I/O efficient, being able to reduce the number of pair-wise record comparison and minimize the total page access cost involved in the cleaning process. Experimental evaluation demonstrates that the algorithms we propose are efficient and are able to achieve better performance than SNM and random access methods.


Statistics for USQ ePrint 8485
Statistics for this ePrint Item
Item Type: Conference or Workshop Item (Commonwealth Reporting Category E) (Paper)
Refereed: Yes
Item Status: Live Archive
Additional Information: Author version not held. Published version unable to be displayed. Series Name: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): volume 6193
Depositing User: Dr Hua Wang
Faculty / Department / School: Historic - Faculty of Sciences - Department of Maths and Computing
Date Deposited: 20 Oct 2010 00:24
Last Modified: 02 Jul 2013 23:59
Uncontrolled Keywords: access cost; cleaning process; data source; duplicate detection; duplication detection; efficient algorithm; experimental evaluation; multiple data sources; random access
Fields of Research (FOR2008): 08 Information and Computing Sciences > 0806 Information Systems > 080604 Database Management
08 Information and Computing Sciences > 0803 Computer Software > 080303 Computer System Security
08 Information and Computing Sciences > 0803 Computer Software > 080309 Software Engineering
Socio-Economic Objective (SEO2008): E Expanding Knowledge > 97 Expanding Knowledge > 970108 Expanding Knowledge in the Information and Computing Sciences
Identification Number or DOI: doi: 10.1007/978-3-642-14589-6_14
URI: http://eprints.usq.edu.au/id/eprint/8485

Actions (login required)

View Item Archive Repository Staff Only