“面向高质量共享的科学数据安全”专刊(上)

面向高维数据发布的差分隐私算法及应用综述

展开
  • 1.中国科学院计算机网络信息中心,北京 100083
    2.中国科学院大学,北京 100190
龙春,E-mail:longchun@cnic.cn
*

收稿日期: 2024-01-30

  录用日期: 2024-06-03

  网络出版日期: 2024-07-03

基金资助

国家重点研发计划:金融数据全周期流转安全风险评估监测与溯源技术研究(2023YFC3304704);中国科学院网络安全和信息化专项(CAS- WX2022GC-04);中国科学院青年创新促进会项目(2022170)

Survey of Differential Privacy Algorithms and Applications for High- Dimensional Data Publishing

Expand
  • 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China

Received date: 2024-01-30

  Accepted date: 2024-06-03

  Online published: 2024-07-03

摘要

随着大数据和机器学习技术的进一步发展,处理具有几十上百维特征的复杂结构和关系且蕴含丰富语义信息的高维数据成为一项挑战。在保障个人隐私不被泄露的前提下,如何安全地使用这些高维数据,成为当前的一个重要话题。我们查阅资料发现:关于差分隐私技术本身的综述很多,但是面向高维数据发布的差分隐私算法及应用的综述却很少。基于此,本文通过对差分隐私在高维数据领域的应用进行综述,深入了解不同方法在保护高维数据隐私方面的优劣,并指导面向高维数据发布的差分隐私算法未来研究的方向,从而更好地应对隐私保护和数据分析的挑战。本文首先介绍了差分隐私的原理和特性,总结了当前差分隐私技术本身的研究工作。然后从数据降维和数据合成两个角度分析了差分隐私在高维数据环境中的应用,探讨了差分隐私面临的问题和挑战,并提出了初步的解决方法,旨在更好地解决当前高维数据保护和使用的问题。最后,本文提出了未来可能的研究方向以促进技术交流,推动差分隐私在高维数据应用中的进一步突破。

本文引用格式

龙春, 秦泽秀, 李丽莎, 李婧, 杨帆, 魏金侠, 付豫豪 . 面向高维数据发布的差分隐私算法及应用综述[J]. 农业大数据学报, 2024 , 6(2) : 170 -184 . DOI: 10.19788/j.issn.2096-6369.200001

Abstract

With the further development of big data and machine learning technologies, handling high-dimensional data with complex structures, relationships, and rich semantic information containing dozens to hundreds of features has become a challenge. Safely utilizing such high-dimensional data, while ensuring the privacy of individuals, has become a significant topic today. Upon reviewing existing literature, we found numerous reviews on differential privacy technology itself, but few on the algorithms and applications of differential privacy specifically tailored for high-dimensional data. Therefore, this paper provides a review of the application of differential privacy in the field of high-dimensional data, aiming to delve into the strengths and weaknesses of different methods in protecting the privacy of high-dimensional data and to guide future research directions for differential privacy algorithms tailored for high-dimensional data publishing. Firstly, this paper introduces the principles and characteristics of differential privacy, summarizing the current research work on the technology itself. Then, it analyzes the application of differential privacy in high-dimensional data environments from the perspectives of data dimensionality reduction and data synthesis, discussing the challenges and issues faced by differential privacy and proposing preliminary solutions to better address the issues of privacy protection and data analysis in the current high-dimensional data landscape. Lastly, potential future research directions are proposed to facilitate technological exchange and further advancements in the application of differential privacy in high-dimensional data settings.

参考文献

[1] Zeng D D, Liu Y, Yan P, et al. Location-aware real-time recommender systems for Brick-and-Mortar Retailers[J]. INFORMS Journal on Computing, 2021, 33:1608-1623. https://doi.org/10.1287/ijoc.2020.1020.
[2] The EU General Data Protection Regulation (GDPR). [EB/OL].. https://eur-lex.europa.eu/eli/reg/2016/679/oj.
[3] The California Consumer Privacy Act (CCPA)[EB/OL]. https://cdp.cooley.com/ccpa-2018/.
[4] Data Security Law of the People's Republic of China[EB/OL]. [2021-06-13]. https://www.gov.cn/xinwen/2021-06/11/content_5616919.htm.
[5] Dwork C, McSherry F, Nissim K, et al. Calibrating Noise to Sensitivity in Private Data Analysis[C]. Theory of Cryptography Conference, Lecture Notes in Computer Science, 2006. https://doi.org/10.1007/11681878_14.
[6] Erlingsson ú, Korolova A, Pihur V. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response[C]. Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 2014. https://doi.org/10.1145/2660267.2660348.
[7] Ding X, Wang C, Choo K R, et al. A novel privacy preserving framework for large scale graph data publishing[J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 33: 331-343. https://doi.org/10.1109/TKDE.2019.2931903.
[8] Draft NIST Privacy Framework: A Tool for Improving Privacy through Enterprise Risk Management. [EB/OL]. https://www.nist.gov/system/files/documents/2020/01/16/NIST%20Privacy%20Framework_V1.0.pdf.
[9] Li X Y, Sun Z L, Deng J B, et al. A comprehensive review of privacy protection technologies[J]. Computer Science, 2013, 40(S2): 199-202.
[10] Li Y, Wen W, Xie G Q. A review of differential privacy protection research[J]. Journal of Computer Applications Research, 2012, 29(9): 3201-3205+3211.
[11] Zhao Y Q, Yang M. A review of research progress on differential privacy[J]. Journal of Computer Science, 2023, 50(4): 265-276.
[12] Gao Z Q, Wang Y T. Research progress on differential privacy techniques[J]. Journal of Communications, 2017, 38(S1): 151-155.
[13] Ye Q Q, Meng X F, Zhu M J, et al. A review of local differential privacy research[J]. Journal of Software, 2018, 29(7): 1981-2005.
[14] Liu J X, Meng X F. A review of privacy protection in machine learning[J]. Journal of Computer Research and Development, 2020, 57(2): 346-362.
[15] Ouadrhiri A E, Abdelhadi A M. Differential privacy for deep and federated learning: a survey[J]. IEEE Access, 2022, 10: 22359-22380. https://ieeexplore.ieee.org/document/9714350.
[16] Kong Y T, Tan F X, Zhao X, et al. A review of research on optimization of k-means algorithm based on differential privacy[J]. Journal of Computer Science, 2022, 49(2): 162-173.
[17] Wang T, Huo Z, Huang Y X, et al. A review of privacy protection technologies in federated learning[J]. Journal of Computer Applications, 2023, 43(2): 437-449.
[18] Narayanan A, Shmatikov V. Robust De-anonymization of Large Sparse Datasets[C]. 2008 IEEE Symposium on Security and Privacy (sp 2008), 2008. https://ieeexplore.ieee.org/document/4531148.
[19] Ouadrhiri A E, Abdelhadi A M. Differential privacy for deep and federated learning: a survey[EB/OL]. IEEE Access, 2022, 10: 22359-22380. https://ieeexplore.ieee.org/document/9714350.
[20] Chu X J. Research on High-Dimensional Data Publishing Method Meeting Local Differential Privacy[D]. Guizhou University, 2022. DOI: 10.27047/d.cnki.ggudu.2022.002172.
[21] Amaratunga D, Cabrera J, Shkedy Z. Exploration and Analysis of DNA Microarray and Other High-Dimensional Data (2nd edition)[EB/OL]. https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118364505. fmatter.
[22] Sweeney L. k-Anonymity: A model for protecting privacy[J]. Int. J. Uncertain. Fuzziness Knowl. Based Syst., 2002, 10: 557-570. https://doi.org/10.1142/S0218488502001648.
[23] Cai M G, Shen G H, Huang Z Q, et al. High-dimensional data publishing method under local differential privacy[J]. Journal of Computer Science, 2024, 51(2): 322-332.
[24] Zhang X, Chen H. A review of high-dimensional data publishing with differential privacy[J]. Journal of Intelligent Systems, 2021, 16(6): 989-998.
[25] Warner S L. Randomized response: a survey technique for eliminating evasive answer bias[J]. Journal of the American Statistical Association, 1965, 60(309): 63-6.. Warner-Randomized-Response-2283137.pdf (ncsu.edu).
[26] McSherry F, Talwar K. Mechanism Design via Differential Privacy[C]. 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07), 2007. https://doi.org/10.1109/FOCS.2007.41.
[27] Kairouz P, Oh S, Viswanath P. Extremal Mechanisms for Local Differential Privacy[C]. J. Mach. Learn. Res., 2016. https://jmlr.org/papers/v17/15-135.html.
[28] Kairouz P, Bonawitz K A, Ramage D. Discrete Distribution Estimation under Local Privacy[C]. International Conference on Machine Learning. 2016. https://proceedings.mlr.press/v48/kairouz16.html.
[29] Ma X, Liu H, Guan S. Improving the Effect of Frequent Itemset Mining with Hadamard Response under Local Differential Privacy[C]. 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2021. https://doi.org/10.1109/TrustCom53373.2021.00072.
[30] Kikuchi H. Castell: Scalable Joint Probability Estimation of Multi-dimensional Data Randomized with Local Differential Privacy[EB/OL]. https://arxiv.org/abs/2212.01627.
[31] Murakami T, Kawamoto Y. Utility-Optimized Local Differential Privacy Mechanisms for Distribution Estimation[C]. USENIX Security Symposium. 2019. https://www.usenix.org/conference/usenixsecurity19/presentation/murakami.
[32] Duchi J C, Wainwright M J, Jordan M I. Minimax optimal procedures for locally private estimation[J]. Journal of the American Statistical Association, 2016, 113: 182 - 201. https://arxiv.org/abs/1604.02390.
[33] Wang N, Xiao X, Yang Y D, et al. Collecting and Analyzing Multidimensional Data with Local Differential Privacy[C]. 2019 IEEE 35th International Conference on Data Engineering (ICDE), 2019. https://ieeexplore.ieee.org/document/8731512/.
[34] Li W, Zhang X, Li X, et al. PPDP-PCAO: An efficient high-dimensional data releasing method with differential privacy protection[J]. IEEE Access, 2019, 7: 176429-176437. https://ieeexplore.ieee.org/document/8924645.
[35] Chaudhuri K, Sarwate A D, Sinha K. A near-optimal algorithm for differentially-private principal components[J]. Journal of Machine Learnning Research, 2013, 14: 2905-2943. https://dl.acm.org/doi/10.5555/2567709.2567754.
[36] Jiang X, Ji Z, Wang S, et al. Differential-Private Data Publishing Through Component Analysis[J]. Transactions on data privacy, 2013, 6(1): 19-34. https://www.tdp.cat/issues11/abs.a109a12.php.
[37] Yang J, Li Y. Differentially private feature selection[C]// 2014 International Joint Conference on Neural Networks (IJCNN), 2014. https://www.sciencedirect.com/science/article/pii/S1877050914010412?via%3Dihub.
[38] Zhang J, Cormode, Procopiuc C M, et al. PrivBayes: private data release via bayesian networks[C]. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014. https://dl.acm.org/doi/10.1145/2588555.2588573.
[39] Li M, Ma X. Bayesian Networks-Based Data Publishing Method Using Smooth Sensitivity[C]. 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/ BDCloud/SocialCom/SustainCom), 2018. https://ieeexplore.ieee.org/xdocument/8672292.
[40] Cheng X, Tang P, Su S, et al. Multi-Party High-Dimensional Data Publishing Under Differential Privacy[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32: 1557-1571. https://ieeexplore.ieee.org/document/8673599/.
[41] Lu X, Piao C, Han J. Differential Privacy High-dimensional Data Publishing Method Based on Bayesian Network[C]. 2022 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), 2022. https://ieeexplore.ieee.org/document/9853392.
[42] Wei F., Zhang W, Chen Y, et al. Differentially Private High- Dimensional Data Publication via Markov Network[C]. Security and Privacy in Communication Networks. 2018. https://link.springer.com/chapter/10.1007/978-3-030-01701-9_8.
[43] Ren X, Yu C, Yu W, et al. LoPub: High-Dimensional Crowdsourced Data Publication With Local Differential Privacy[J]. IEEE Transactions on Information Forensics and Security, 2018, 13(9): 2151-2166. https://ieeexplore.ieee.org/document/8306916.
[44] Liu G, Tang P, Hu C, et al. Multi-Dimensional Data Publishing With Local Differential Privacy[C]. International Conference on Extending Database Technology. 2023. https://openproceedings.org/2023/conf/edbt/paper-210.pdf.
[45] Ray P, Reddy S S, Banerjee T S. Various dimension reduction techniques for high dimensional data analysis: a review[J]. Artificial Intelligence Review, 2021, 54: 3473-3515. https://link.springer.com/article/10.1007/s10462-020-09928-0.
[46] Chen Y. Research on health and medical data sharing and personal information protection[J]. Journal of Intelligence, 2023, 42(5): 192-199.
[47] Bai W T, Chen L X. A health medical data protection scheme based on differential privacy[J]. Computer Applications and Software, 2022, 39 (8): 304-311.
[48] Zhang S, Li X. Differential privacy medical data publishing method based on attribute correlation[J]. Scientific Reports, 2022, 12. https://www.nature.com/articles/s41598-022-19544-3.pdf.
[49] Rong J. An electronic medical record data security risk monitoring system based on differential privacy protection[J]. Automation Technology and Applications, 2022, 41(12): 169-172.
[50] Tan L. Theory and Application of Dimensionality Reduction for High- dimensional Data[D]. National University of Defense Technology, 2005.
[51] Yuan K, Cheng Y. Data risks of financial technology and its prevention and control strategies[J]. Journal of Beijing University of Aeronautics and Astronautics (Social Sciences Edition), 2023, 36(2): 46-58.
[52] Zhu Z W, Zhang X. A brief analysis of privacy computing applications in the financial field[J]. Research on Financial Development, 2023(3): 90-92.
[53] Deng W, Chen X T, Zhang Q H, et al. Differential privacy protection algorithm based on Tree Models[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2020, 32(5): 848-856.
[54] Byrd D, Polychroniadou A. Differentially private secure multi-party computation for federated learning in financial applications[C]. Proceedings of the First ACM International Conference on AI in Finance. 2020. https://dl.acm.org/doi/10.1145/3383455.3422562.
文章导航

/