Khan, Atikur R. and Kabir, Enamul ORCID: https://orcid.org/0000-0002-6157-2753
(2021)
Resampling methods for generating continuous multivariate synthetic data for disclosure control.
Journal of Data, Information and Management, 3.
pp. 225-235.
ISSN 2524-6356
Abstract
Sharing microdata within or outside of an organization may lead to the disclosure of sensitive information of an individual. Data stewarding organizations often disseminate synthetic data to reduce the likelihood of disclosure of sensitive information. Synthetic data can be generated from posterior predictive distributions, however, finding a distribution in multidimensional space is not straight forward. If a distribution function is correctly estimated, synthetic data generated from the estimated distribution will hold all statistical properties of the original data. In practice, distribution functions are unknown and estimation of distribution function under some assumptions may result in a synthetic data set that does not hold statistical properties of the original data. This paper develops synthetic data generating methods based on resampling from singular vectors and eigenvalues without requiring estimation of posterior predictive distribution function for the data matrix. Methods developed in this paper have been implemented to generate continuous multivariate synthetic data, and performances of these methods are studied by comparing the disclosure risk and information loss measures. A rectangular cuboid is also constructed from the lower quartiles of information loss and disclosure risk measures, and selection of synthetic data from this rectangular cuboid is found to reduce the disclosure risk and information loss of these methods further.
![]() |
Statistics for this ePrint Item |
Item Type: | Article (Commonwealth Reporting Category C) |
---|---|
Refereed: | Yes |
Item Status: | Live Archive |
Additional Information: | Permanent restricted access to Published version in accordance with the copyright policy of the publisher. |
Faculty/School / Institute/Centre: | Historic - Faculty of Health, Engineering and Sciences - School of Sciences (6 Sep 2019 - 31 Dec 2021) |
Faculty/School / Institute/Centre: | Historic - Faculty of Health, Engineering and Sciences - School of Sciences (6 Sep 2019 - 31 Dec 2021) |
Date Deposited: | 06 Oct 2021 05:59 |
Last Modified: | 08 Nov 2021 01:25 |
Uncontrolled Keywords: | synthetic data; disclosure control |
Fields of Research (2008): | 08 Information and Computing Sciences > 0801 Artificial Intelligence and Image Processing > 080199 Artificial Intelligence and Image Processing not elsewhere classified |
Fields of Research (2020): | 46 INFORMATION AND COMPUTING SCIENCES > 4605 Data management and data science > 460599 Data management and data science not elsewhere classified |
Socio-Economic Objectives (2008): | E Expanding Knowledge > 97 Expanding Knowledge > 970108 Expanding Knowledge in the Information and Computing Sciences |
Socio-Economic Objectives (2020): | 22 INFORMATION AND COMMUNICATION SERVICES > 2299 Other information and communication services > 229999 Other information and communication services not elsewhere classified |
Identification Number or DOI: | https://doi.org/10.1007/s42488-021-00054-2 |
URI: | http://eprints.usq.edu.au/id/eprint/43745 |
Actions (login required)
![]() |
Archive Repository Staff Only |