Resampling methods for generating continuous multivariate synthetic data for disclosure control

Khan, Atikur R. and Kabir, Enamul ORCID: https://orcid.org/0000-0002-6157-2753 (2021) Resampling methods for generating continuous multivariate synthetic data for disclosure control. Journal of Data, Information and Management, 3. pp. 225-235. ISSN 2524-6356


Abstract

Sharing microdata within or outside of an organization may lead to the disclosure of sensitive information of an individual. Data stewarding organizations often disseminate synthetic data to reduce the likelihood of disclosure of sensitive information. Synthetic data can be generated from posterior predictive distributions, however, finding a distribution in multidimensional space is not straight forward. If a distribution function is correctly estimated, synthetic data generated from the estimated distribution will hold all statistical properties of the original data. In practice, distribution functions are unknown and estimation of distribution function under some assumptions may result in a synthetic data set that does not hold statistical properties of the original data. This paper develops synthetic data generating methods based on resampling from singular vectors and eigenvalues without requiring estimation of posterior predictive distribution function for the data matrix. Methods developed in this paper have been implemented to generate continuous multivariate synthetic data, and performances of these methods are studied by comparing the disclosure risk and information loss measures. A rectangular cuboid is also constructed from the lower quartiles of information loss and disclosure risk measures, and selection of synthetic data from this rectangular cuboid is found to reduce the disclosure risk and information loss of these methods further.


Statistics for USQ ePrint 43745
Statistics for this ePrint Item
Item Type: Article (Commonwealth Reporting Category C)
Refereed: Yes
Item Status: Live Archive
Additional Information: Permanent restricted access to Published version in accordance with the copyright policy of the publisher.
Faculty/School / Institute/Centre: Historic - Faculty of Health, Engineering and Sciences - School of Sciences (6 Sep 2019 - 31 Dec 2021)
Faculty/School / Institute/Centre: Historic - Faculty of Health, Engineering and Sciences - School of Sciences (6 Sep 2019 - 31 Dec 2021)
Date Deposited: 06 Oct 2021 05:59
Last Modified: 05 Aug 2022 01:55
Uncontrolled Keywords: synthetic data; disclosure control
Fields of Research (2008): 08 Information and Computing Sciences > 0801 Artificial Intelligence and Image Processing > 080199 Artificial Intelligence and Image Processing not elsewhere classified
Fields of Research (2020): 46 INFORMATION AND COMPUTING SCIENCES > 4605 Data management and data science > 460599 Data management and data science not elsewhere classified
Socio-Economic Objectives (2008): E Expanding Knowledge > 97 Expanding Knowledge > 970108 Expanding Knowledge in the Information and Computing Sciences
Socio-Economic Objectives (2020): 22 INFORMATION AND COMMUNICATION SERVICES > 2299 Other information and communication services > 229999 Other information and communication services not elsewhere classified
Identification Number or DOI: https://doi.org/10.1007/s42488-021-00054-2
URI: http://eprints.usq.edu.au/id/eprint/43745

Actions (login required)

View Item Archive Repository Staff Only