Zhang, Ji and Shu, Yanfeng and Wang, Hua (2010) On memory and I/O efficient duplication detection for multiple self-clean data sources. In: DASFAA 2010: 15th International Conference on Database Systems for Advanced Applications , 1-4 Apr 2010, Tsukuba, Japan.
In this paper, we propose efficient algorithms for duplicate detection from multiple data sources that are themselves duplicate-free. When developing these algorithms, we take the full consideration of various possible cases given the workload of data sources to be cleaned and the available memory. These algorithms are memory and I/O efficient, being able to reduce the number of pair-wise record comparison and minimize the total page access cost involved in the cleaning process. Experimental evaluation demonstrates that the algorithms we propose are efficient and are able to achieve better performance than SNM and random access methods.
|Item Type:||Conference or Workshop Item (Commonwealth Reporting Category E) (Paper)|
|Additional Information:||Author version not held. Published version unable to be displayed. Series Name: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): volume 6193|
|Uncontrolled Keywords:||access cost; cleaning process; data source; duplicate detection; duplication detection; efficient algorithm; experimental evaluation; multiple data sources; random access|
|Depositing User:||Dr Hua Wang|
|Date Deposited:||20 Oct 2010 00:24|
|Last Modified:||02 Jul 2013 23:59|
Actions (login required)
|Archive Repository Staff Only|