VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
LẠI TRUNG MINH ĐỨC
PRESERVING PRIVACY FOR
PUBLISHING-TIME-SERIES DATA
WITH DIFFERENTIAL PRIVACY
Major: COMPUTER SCIENCE
Major code: 8480101
MASTER’S THESIS
HO CHI MINH CITY, July 2023
ii
THIS THESIS IS COMPLETED AT
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM
Supervisor 1: Assoc. Professor. DANG TRAN KHANH
Supervisor 2: PhD. LE LAM SON
Examiner 1: Assoc. Professor. TRAN TUAN DANG
Examiner 2: PhD. DANG TRAN TRI
This master’s thesis is defended at HCM City University of Technology,
VNU- HCM City on 10 July 2023.
Master’s Thesis Committee:
1. Chairman - Assoc. Professor. TRAN MINH QUANG
2. Secretary - PhD. NGUYEN THI AI THAO
3. Examiner 1 - Assoc. Professor. TRAN TUAN DANG
4. Examiner 2 - PhD. DANG TRAN TRI
5. Commissioner - PhD. TRUONG THI AI MINH
Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty
of Computer Science and Engineering after the thesis being corrected (If any).
CHAIRMAN OF THESIS COMMITEE
HEAD OF FACULTY OF
COMPUTER SCIENCE AND ENGINEERING
iii
VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom - Happiness
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
THE TASK SHEET OF MASTER’S THESIS
Full name: LẠI TRUNG MINH ĐỨC
Date of birth: 24 May 1996
Major: Computer Science
I.
Student ID: 2070686
Place of birth: Ho Chi Minh City
Major ID: 8480101
THESIS TITLE:
Preserving Privacy for Publishing Time-series Data with Differential Privacy
(Duy trì quyền riêng tư cho thời gian xuất bản dữ liệu chuỗi với quyền riêng tư
khác biệt)
II.
TASKS AND CONTENTS:
Week
Task
Time
W1 to W2
- Conduct the literature review and methodology to conduct the
study.
- Define scope of work for the main research of the thesis
- Write up the report
2 weeks
W3 to W4
- Research of related works/projects on Differential Privacy,
Time-series privacy
- Write up the report (cont.)
2 weeks
W5 to
W14
- Implementing the algorithms of Differential Privacy on Time
series data
- Comparing those algorithms with data utility metrics
- Finding the data characteristics to choose the best algorithms
- Write up the report (cont.)
10
weeks
W15 to
W16
- Finalize the solution package
- Finalize the document
- Prepare the presentation
2 weeks
Thesis Submission – June 2023
iv
III. THESIS START DAY:
(According to the decision on assignment of Master’s thesis)
05 September 2022
IV. THESIS COMPLETION DAY:
(According to the decision on the assignment of the Master’s thesis)
12 June 2023
V.
SUPERVISOR (Please fill in the supervisor’s full name and academic rank)
Assoc. Prof DANG TRAN KHANH – PGS. TS ĐẶNG TRẦN KHÁNH
PhD. LE LAM SON – TS. LÊ LAM SƠN
Ho Chi Minh City, 09 June 2023
SUPERVISOR 1
(Full name and signature)
Assoc. Prof DANG TRAN KHANH
SUPERVISOR 2
(Full name and signature)
PhD. LE LAM SON
CHAIR OF PROGRAM COMMITTEE
(Full name and signature)
DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING
(Full name and signature)
Note: Student must pin this task sheet as the first page of the Master’s Thesis
booklet
v
ACKNOWLEDGEMENT
I would like to express my profound gratitude to all those who have supported me
throughout the journey of completing this master's thesis.
First and foremost, I extend my heartfelt thanks to my supervisors, Assoc. Prof.
DANG TRAN KHANH and PhD. LE LAM SON, for your invaluable guidance,
patience, and expertise. Working under your supervision has been an honor, and your
insightful feedback and encouragement have been instrumental in shaping the
outcome of this research.
I would also like to extend my sincere appreciation to the team at Unilever Vietnam,
particularly Mr. ERIC FRANCIS CHEN – Head of UniOps and Data & Analytics
SEA&I, Mr. IAN LOW - my line manager, and the awesome Unilever Data &
Analytics Vietnam team. Your willingness to assist, and insightful discussions have
significantly contributed to the successful completion of this thesis.
Furthermore, I want to acknowledge my friends, classmates, doctors, and
psychologists for your mental supports and contributions. Your motivation,
discussions, treatments with diverse perspectives have supported me to live
positively. Special appreciation goes to Ms. HOA NINH, Ms. LINH PHAM, Ms.
XUAN NGUYEN, Mr. RYAN NGUYEN for your huge support and encouragement
throughout this journey.
Lastly, I am indebted to my family: mom THAI LINH PHAM, sister MARY PHUC
LAI for your unconditional love, and constant encouragement. Your sacrifices and
understanding have been the bedrock of my achievements. I am profoundly grateful
for your presence and support.
To all those mentioned above, as well as those who have contributed in immeasurable
ways, I offer my sincerest thanks. Your efforts, belief, and contributions have made
this thesis possible.
vi
ABSTRACT
This thesis explores the crucial domain of data privacy, encompassing the rights of
individuals to maintain control over the collection, usage, sharing, and storage of their
personal data. Within the realm of personal data, time-series data poses distinct
challenges and sensitivities when it comes to privacy protection. Time-series data
comprises information with temporal attributes that can unveil patterns, trends, and
behaviors of individuals or groups over time, and carries inherent risks in terms of
privacy breaches.
The primary objectives of this thesis are as follows: first, to review traditional
methods of privacy-preserving data publishing, with a specific focus on efforts made
for protecting time-series data; second, to gain a comprehensive understanding of the
theories and principles of Differential Privacy, a promising approach for privacy
preservation; third, to explore notable mechanisms within Differential Privacy that
are applicable to time-series data; fourth, to investigate and address privacy
challenges in data partnerships through the integration of Differential Privacy and
other relevant techniques; and finally, to develop a process for the application of
privacy techniques within the context of business collaborations.
The contribution of this thesis is twofold. Firstly, it aims to make the concept of
Differential Privacy more accessible and comprehensible, particularly for nonacademic and corporate audiences who may not have a deep technical background.
By presenting Differential Privacy in a clear and straightforward manner, this
research facilitates its adoption and implementation in real-world scenarios.
Secondly, this thesis proposes a guideline for the practical application and evaluation
of Differential Privacy on time-series data, specifically within the context of data
collaboration among multiple parties. The guideline serves as a valuable resource for
organizations seeking to protect privacy while engaging in collaborative data
initiatives.
vii
TÓM TẮT LUẬN VĂN
Luận văn này nghiên cứu về lĩnh vực quan trọng của quyền riêng tư dữ liệu, bao gồm
quyền của cá nhân để duy trì sự kiểm sốt về việc thu thập, sử dụng, chia sẻ và lưu
trữ dữ liệu cá nhân của họ. Trong lĩnh vực dữ liệu cá nhân, dữ liệu chuỗi thời gian
đặt ra những thách thức và nhạy cảm riêng biệt khi đến việc bảo vệ quyền riêng tư.
Dữ liệu chuỗi thời gian bao gồm thơng tin có thuộc tính thời gian có thể tiết lộ các
mẫu, xu hướng và hành vi của cá nhân hoặc nhóm qua thời gian, và mang theo các
rủi ro về việc vi phạm quyền riêng tư.
Các mục tiêu chính của luận văn này như sau: thứ nhất, xem xét các phương pháp
truyền thống về việc xuất bản dữ liệu bảo vệ quyền riêng tư, với tập trung đặc biệt
vào các nỗ lực để bảo vệ dữ liệu chuỗi thời gian; thứ hai, để hiểu rõ về các lý thuyết
và nguyên tắc của Sự khác biệt về Quyền riêng tư, một phương pháp hứa hẹn để bảo
vệ quyền riêng tư; thứ ba, khám phá các cơ chế đáng chú ý trong Sự khác biệt về
Quyền riêng tư mà có thể áp dụng cho dữ liệu chuỗi thời gian; thứ tư, điều tra và đối
mặt với những thách thức về quyền riêng tư trong các đối tác dữ liệu thông qua việc
tích hợp Sự khác biệt về Quyền riêng tư và các kỹ thuật liên quan khác; và cuối cùng,
phát triển quy trình để áp dụng các kỹ thuật bảo vệ quyền riêng tư trong bối cảnh hợp
tác kinh doanh.
Đóng góp của luận văn này là hai phần. Thứ nhất, nó nhằm mục tiêu làm cho khái
niệm về Sự khác biệt về Quyền riêng tư trở nên dễ tiếp cận và dễ hiểu, đặc biệt là đối
với các đối tượng khơng học thuật và doanh nghiệp có thể khơng có nền tảng kỹ thuật
sâu. Bằng cách trình bày Sự khác biệt về Quyền riêng tư một cách rõ ràng và dễ hiểu,
nghiên cứu này hỗ trợ việc áp dụng và thực hiện nó trong các tình huống thực tế. Thứ
hai, luận văn này đề xuất một hướng dẫn cho việc áp dụng thực tiễn và đánh giá Sự
khác biệt về Quyền riêng tư trong dữ liệu chuỗi thời gian, đặc biệt trong bối cảnh hợp
tác dữ liệu giữa nhiều bên.
viii
THE COMMITMENT OF THESIS’ AUTHOR
I hereby declare that this master thesis is my own original work and has not been
submitted before to any institution for assessment purposes.
Further, I have acknowledged all sources used and have cited these in the reference
section.
……………………………..
LAI TRUNG MINH DUC
………………………
Date
ix
TABLE OF CONTENTS
CHAPTER 1: OVERVIEW OF THE THESIS........................................................... 1
1. Background and Context................................................................................... 1
2. Data Publishing and Privacy Preserving Data Publishing ................................ 2
3. Challenges of Privacy Preserving Data Publishing (PPDP) for Time-series
data3
4. Differential Privacy as a powerful player ......................................................... 4
5. Thesis objectives ............................................................................................... 5
6. Thesis contributions .......................................................................................... 5
7. Thesis structure ................................................................................................. 5
CHAPTER 2: PRIVACY MODELS RESEARCHS .................................................. 7
1. Attack models and notable privacy models ...................................................... 7
1.1.
Record linkage attack and k-Anonymity privacy model ........................... 7
1.2.
Attribute linkage attack and l-diversity and t-closeness privacy model .... 8
1.3.
Table linkage and δ-presence privacy model ........................................... 10
1.4.
Probabilistic linkage and Differential Privacy model .............................. 11
2. Summary ......................................................................................................... 12
CHAPTER 3: THE INVESTIGATION ON DIFFERENTIAL PRIVACY ............. 14
1. The need for Differential Privacy principle. ................................................... 14
1.1.
No need to model the attack model in detail. ........................................... 14
1.2.
Quantifiable privacy loss.......................................................................... 15
1.3.
Multiple mechanisms composition. ......................................................... 16
2. The promise (and not promised) of Differential Privacy. ............................... 17
x
2.1.
The promise .............................................................................................. 17
2.2.
The not promise ........................................................................................ 18
2.3.
Conclusion ................................................................................................ 18
3. Formal definition of Differential Privacy ....................................................... 18
3.1.
Terms and notations ................................................................................. 19
3.2.
Randomized algorithm ............................................................................. 20
3.3.
𝜀-differential privacy ................................................................................ 20
3.4.
(𝜀, 𝛿) differential privacy ......................................................................... 21
4. Important concepts of Differential Privacy..................................................... 22
4.1.
The sensitivity .......................................................................................... 23
4.2.
Privacy composition ................................................................................. 24
4.3.
Post processing ......................................................................................... 26
5. Foundation mechanisms of Differential Privacy ............................................ 26
5.1.
Local Differential Privacy and Global Differential Privacy .................... 26
5.2.
Laplace mechanism .................................................................................. 27
5.3.
Exponential mechanism ........................................................................... 29
6. Notable mechanisms for Time-series data ...................................................... 30
6.1.
Laplace mechanism (LPA – Laplace Perturbation Algorithm) ............... 30
6.2.
Discrete Fourier Transform (DFT) with Laplace mechanism (FPA –
Fourier Perturbation Algorithm) ........................................................................ 31
6.3.
Temporal perturbation mechanism .......................................................... 32
6.4.
STL-DP – Perturbed time-series by applying DFT with Laplace
mechanism on trends and seasonality. ............................................................... 37
CHAPTER 4: EXPERIMENT DESIGNS ................................................................ 38
xi
1. Experiment designs ......................................................................................... 38
1.1.
Case description ....................................................................................... 38
1.2.
Data structure aligns with data provider. ................................................. 39
1.3.
System alignments ................................................................................... 40
1.4.
Concerns and constraints.......................................................................... 41
2. Problem analysis ............................................................................................. 41
2.1.
Revisit the GDPR related terms for data sharing. .................................... 41
2.2.
Potential attack models and countermeasures .......................................... 44
2.3.
Define scope of work ............................................................................... 45
3. Evaluation methodology ................................................................................. 46
3.1.
Data utility ................................................................................................ 46
3.2.
Privacy metrics ......................................................................................... 46
3.3.
Evaluation process.................................................................................... 47
4. Privacy protection proposal ............................................................................ 47
CHAPTER 5: EXPERIMENT IMPLEMENTATIONS ........................................... 49
1. Experiment preparation ................................................................................... 49
2. Data exploration analysis (EDA) .................................................................... 49
2.1.
Data overview .......................................................................................... 49
2.2.
Descriptive Analysis ................................................................................ 50
2.3.
Maximum data domain estimation ........................................................... 52
3. Differential Privacy mechanisms implementation .......................................... 54
4. Data perturbed evaluation. .............................................................................. 61
4.1.
RFM Analysis for dataset......................................................................... 61
xii
4.2.
Forecasting trendline at categories, consumer-groups, and store level ... 65
4.3.
Privacy evaluation .................................................................................... 73
4.4.
Recommendation for using Differential Privacy in data partnership use-
cases. 77
CHAPTER 6: CONCLUSION AND FUTURE WORKS ........................................ 78
REFERENCES .......................................................................................................... 80
APPENDIX ............................................................................................................... 83
Table: Descriptive Statistics table of the data perturbation output ........................ 83
Table: Accuracy result of RFM analysis between data perturbations and original
data ......................................................................................................................... 85
Table: Accuracy (RMSE) result from the forecast of data perturbations and
original data ........................................................................................................... 88
Table: Accuracy (RMSE) result of the adjusted forecast version ......................... 90
xiii
TABLE OF TABLES
Table 1. Original Shopping data table structure ....................................................... 39
Table 2. Shopping Data table structure for the use-case ........................................... 40
Table 3. Data structure of the synthesis dataset ........................................................ 50
Table 4. Maximum domain value estimation for each Category .............................. 53
Table 5. Descriptive Statistics table of the data perturbation output (detail table in
the Appendix) ............................................................................................................ 60
Table 6. Accuracy result of RFM analysis between data perturbations and original
data (detail in Appendix) ........................................................................................... 64
Table 7. Accuracy (RMSE) result from the forecast of data perturbations and
original data (detail in Appendix) ............................................................................. 70
Table 8. Accuracy (RMSE) result of the adjusted forecast version (detail in
Appendix) .................................................................................................................. 72
Table 9. Data Utility and Data Privacy improving ratio .......................................... 74
xiv
TABLE OF FIGURES
Figure 1. A fictitious database with hospital data, taken from [14] ............................ 8
Figure 2. Visualize how Sequential and Parallel Composition works – take from
[25] ............................................................................................................................ 25
Figure 3. Visualize how Local Privacy and Global Privacy works - take from [25] 27
Figure 4. The Laplace mechanism with multiple scales ........................................... 28
Figure 5. The visualization process of LPA and DFT (or FPA) - take from [19] .... 32
Figure 6. The visualization of Backward Perturbation mechanism (BPA) - take from
[4] .............................................................................................................................. 34
Figure 7. The visualization of Forward Perturbation mechanism - take from [4] .... 35
Figure 8. The cost analysis table for BPA and FPA method - take from [4] ............ 36
Figure 9. The pseudo-code for Threshold Mechanism - taken from [4] ................... 36
Figure 10. The process of how STL-DP mechanism works - take from [27] ........... 37
Figure 11. EDA - Fetch the synthesis dataset and display on Python ...................... 49
Figure 12. EDA - Code block and display for dataset descriptive statistics ............. 50
Figure 13. EDA - Code block for dataset aggregation to find TOP 5 huge value
consumers .................................................................................................................. 51
Figure 14. EDA - Code block to visualize the purchase pattern over year by week of
the dataset (aggregation) ........................................................................................... 52
Figure 15. Box-plot to visualize the quantity range - support for estimate the
maximum data domain .............................................................................................. 54
Figure 16. Code-block - Import related libraries and initialize Ray multiprocessing
................................................................................................................................... 55
Figure 17. Code-block - Implementation of Laplace Perturbation Mechanism on
time-series data.......................................................................................................... 56
Figure 18. Code-block - Implementation of Fourier Perturbation Mechanism on
time-series data.......................................................................................................... 56
Figure 19. Code-block - Implementation of STL-DP Mechanism on time-series data
................................................................................................................................... 57
xv
Figure 20. Code-block - Setup the Epsilon and Maximum data domain values....... 58
Figure 21. Code-block function to run the data perturbation process ....................... 59
Figure 22.Code-block for multi-processing run the whole process .......................... 60
Figure 23. Code-block to implement RFM Analysis ................................................ 63
Figure 24. Code-block to measure accuracy of RFM clusters for data perturbation
methods and the clusters of original data .................................................................. 64
Figure 25. EDA - Code-block to calculate number of timeseries/groups to conduct
the forecast in next section. ....................................................................................... 66
Figure 26. EDA - Code-block to calculate number of timeseries/groups to conduct
the forecast in next section - Select only value > 300 days/year .............................. 67
Figure 27. Code-block for building data lookup dictionary for RFM bins ............... 67
Figure 28. . Code-block for running Simple Linear Regression to forecast, combine
with the parameter M lookup table ........................................................................... 68
Figure 29. Code-block for RMSE calculation for the forecast output. ..................... 69
Figure 30. Adjusted version code-block to run Simple Linear Regression on the data
perturbations and original data. ................................................................................. 71
Figure 31. Data Utility and Privacy trade-off (RMSE normal scale) ....................... 73
Figure 32. Data Utility and Privacy trade-off (RMSE Log10 scale) ........................ 74
Figure 33. EDA - Check the TOP 5 high value consumers in multiple methods
(Original data, FPA, tFPA) ....................................................................................... 75
Figure 34. EDA - Plot Line chart to compare 3 methods and original volume of
USER 944.................................................................................................................. 76
Figure 35. EDA - Plot Line chart to compare 3 methods and original volume of
USER 944 (by category) ........................................................................................... 77
1
CHAPTER 1: OVERVIEW OF THE THESIS
1. Background and Context
Data privacy, also referred to as information privacy or data protection, encompasses
the fundamental right of individuals to exert control over the collection, usage,
sharing, and storage of their personal data by others. Personal data encompasses any
information that identifies or relates to an individual, including but not limited to their
name, address, phone number, email, health records, financial transactions, online
activities, preferences, and opinions.
Within the realm of personal data, time-series data poses unique challenges and
sensitivities in terms of privacy protection. Time-series data contains temporal
information that can unveil patterns, trends, and behaviors of individuals or groups
over time. Examples of personal time-series data include weblogs, social media posts,
GPS traces, health records, among others. While time-series data serves various
purposes, such as forecasting, anomaly detection, classification, clustering, and
summarization, it also carries inherent risks if privacy breaches occur.
For instance, an attacker could potentially deduce the identity of a user by matching
their GPS traces or weblogs with publicly available information. Similarly, the
analysis of temporal patterns or correlations within sensor readings or social media
posts could lead to the identification of a user. Moreover, the observation of changes
or trends over time in health records can reveal an individual's preferences, habits,
and activities. These scenarios underscore the significance of safeguarding privacy
when dealing with time-series data to mitigate the potential risks associated with
unauthorized access or misuse of personal information.
However, in the world of collaboration, there is a real need for data sharing between
multiple researchers from multiple corporates to conduct the analysis that benefits
humankind (e.g. COVID-19 mobility data sharing from Google and Apple). At that
stage, the data will be published and shared between the parties, and while it might
2
apply anonymization techniques before publishing, there is still a significant risk of
reverting identities. For instances,
•
In 2006, AOL released a dataset of 20 million search queries from
650,000 users over a three-month period [1]. The dataset was intended for
research purposes and was anonymized by replacing user IDs with random
numbers. However, researchers and journalists were able to re-identify some
users by analyzing their search queries and linking them with other public
information, such as news articles, blogs, or social media profiles. Some users
were exposed to embarrassment, harassment, or legal troubles because of their
search histories being revealed.
•
In 2007, Netflix released a dataset of 100 million movie ratings from
480,000 users (about half the population of Montana) over a six-year period.
The dataset was intended for a contest to improve Netflix's recommendation
system and was anonymized by removing usernames and other identifying
information. However, researchers [2] were able to re-identify some users by
comparing their movie ratings with those on another movie website, IMDb.
Some users were exposed to privacy breaches as their movie preferences could
reveal their political views, sexual orientation, or health conditions.
2. Data Publishing and Privacy Preserving Data Publishing
Data publishing is the process of making data available to the public or other parties
for analysis, research, or decision-making. However, data often contains sensitive
information about individuals, such as their identities, preferences, health conditions,
or financial records. If such data is published without proper protection, it may lead
to privacy breaches and harm the individuals' interests or rights.
Privacy-preserving data publishing (PPDP) is a field of study that aims to protect
individuals' privacy while maintaining the data's utility for legitimate purposes. PPDP
involves applying various techniques to transform or anonymize the data before
publishing it, such that the sensitive information is hidden or distorted, but the general
patterns or statistics are preserved.
3
There are different types of data that can be published, such as tabular data (e.g.,
census records, medical records) or graph data (e.g., social networks, web graphs).
Depending on the data type and the privacy requirements, different PPDP
techniques can be used. Some of the common PPDP techniques are (adapted from
[7] and [16]):
•
Generalization: This technique replaces specific values in the data
with more general ones, such as replacing ages with age ranges or zip codes
with regions. Generalization reduces the granularity of the data and makes it
harder to identify individuals based on their attributes.
•
Suppression: This technique removes some values or records from the
data entirely, such as deleting names or phone numbers or omitting outliers.
Suppression reduces the amount of information in the data and makes it
harder to link individuals across different sources.
•
Perturbation: This technique adds noise or randomness to the data,
such as swapping values among records, adding or subtracting a small
amount from numerical values, or flipping bits in binary values. Perturbation
alters the original values in the data and makes it harder to infer the true
values based on statistical analysis.
•
Encryption: This technique transforms the data using cryptographic
methods, such as hashing, encryption, or homomorphic encryption.
Encryption protects the confidentiality of the data and makes it harder to
access the original values without a secret key.
3. Challenges of Privacy Preserving Data Publishing (PPDP) for Timeseries data
As noted above, PPDP is a research area that aims to protect the privacy of
individuals or organizations while allowing valuable data analysis on published
data. Time-series data is a type of data that records the values of certain variables
over time, such as sensor readings, stock prices, or health records. PPDP for timeseries data poses several challenges, such as (adapted from [7] and [16]):
4
•
Correlation: Time-series data often exhibit strong temporal or spatial
correlation, which means that some attributes’ values depend on the values
of other attributes or previous time points. This can lead to privacy breaches
if an adversary can exploit this correlation to infer sensitive information from
the published data.
•
Dynamics: Time-series data is often updated or streamed
continuously, which requires efficient and adaptive PPDP methods that can
handle dynamic changes and updates without compromising privacy or
utility.
4. Differential Privacy as a powerful player
To address these challenges above, there is a principle that is accepted and explored
by many researchers nowadays, called Differential Privacy – invented in 2006 by
Professor Cynthia Dwork. (adapted from [7], [11], [16], [21], [22])
Differential privacy is a commitment made by a data holder or curator to a data
subject: No matter what other research, data sets or information sources are
available, enabling the use of your data in any study or analysis will have no effect
on you, whether positive or negative. At their finest, differential private database
systems may make secret data publicly accessible for effective data analysis without
the need for clean data rooms, data usage agreements, or limited views.
Differential privacy guarantees that the same conclusions, such as smoking causes
cancer, will be made regardless of whether a person opts in or out of the data
collection. Particularly, it assures that each output sequence (responses to queries) is
"basically" equally likely to occur, regardless of the existence or absence of any
individual. Here, probabilities are taken over random selections made by the privacy
mechanism (something controlled by the data curator), and the phrase "basically" is
represented by the parameter.
Differential privacy can be achieved by adding carefully calibrated noise to the
original data or to the output of a data analysis function. Differential privacy has
5
been applied to time-series data in various scenarios, such as smart metering,
location-based services, or health monitoring.
In this thesis, we will explore more Differential Privacy from theory to some
proposed solutions to protect privacy of time-series data.
5. Thesis objectives
•
To review the traditional Privacy Preserving Data Publishing methods
and the efforts for time-series data.
•
To understand the theories and principles of Differential Privacy.
•
To explore notable mechanisms in Differential Privacy for time-series
data.
•
To explore and solve the privacy use-case in the data partnership with
Differential Privacy and other techniques. Then, build the process to apply
privacy techniques into business collaboration.
6. Thesis contributions
•
An effort to make differential privacy gets easier to understand,
especially for non-academic audiences and corporate audiences.
•
A suggested guideline to apply and evaluate differential privacy on
time-series data in the use-case of data collaboration between multiple
parties.
7. Thesis structure
Chapter 1: Background and context (this chapter) – summaries of the purpose of data
publishing, the purpose of privacy protection for data released, the challenges on
time-series data, and the brief of the Differential Privacy principle.
Chapter 2: Literature review – summaries and compares privacy attack models with
notable algorithms, quickly analyses the weaknesses of its and the connection to
Differential Privacy.
Chapter 3: Investigation on Differential Privacy – explore the mathematical concepts,
theories, foundational mechanisms, and related techniques for time-series data.
6
Chapter 4: Experiment designs – based on the synthesis use-case from real world
problem, analyses and proposes a process and solution to protect privacy of
individuals in data partnership process.
Chapter 5: Experiment implementations – based on the proposal in Chapter 4 and
utilize knowledge in Differential Privacy technique and Data Analytics to
implement related requirements.
Chapter 6: Conclusion and Future works
7
CHAPTER 2: PRIVACY MODELS RESEARCHS
1. Attack models and notable privacy models
1.1.
Record linkage attack and k-Anonymity privacy model
Record linkage attacks (adapted from [7], [9], [16]) are privacy attacks that exploit
the linkage of records across different data sources to reveal sensitive information
about individuals. These attacks typically involve identifying records in different
datasets that correspond to the same individual and then linking the records together
to obtain additional information. Usually, the linkage attack happens based on
quasi-identifiers (QIDs): such as age, gender, or ZIP code, to link records that
correspond to the same individual across different datasets.
From the research of Sweeny [9], based on just ZIP code, date of birth, and gender,
there is 87% chance to identify uniquely individuals in the US, which is a very high
probability.
To counter this attack, k-anonymity method has been proposed by Sweeny [9]: if
one record in the table has some value QID, then at least k-1 other records also have
the value QID, which means each record is indistinguishable from at least k-1 other
records with respect to QID. Therefore, the probability of linking a victim to a
specific record through the QID is at most 1/k.
k-anonymity uses data generalization and suppression techniques to achieve that
every combination of values for the quasi-identifiers can be indistinguishably
matched to at least k-individuals in one equivalence class.
While k-anonymity can prevent identity disclosure, thus, for no individual it is
possible to identify his exact record in the database, but it does not provide attribute
disclosure. Attribute disclosure occurs if it is possible, even without finding the
exact record of an individual, to reveal his sensitive attributes.
8
Figure 1. A fictitious database with hospital data, taken from [14]
1.2. Attribute linkage attack and l-diversity and t-closeness privacy
model
The attribute linkage attack (adapted from [7], [14], [16]) is a method that enables an
attacker to infer sensitive values of a target victim from the published data, using the
set of sensitive values associated with the group that the victim belongs to, even when
the attacker cannot accurately identify the victim's record. K-anonymized databases
suffer from attribute disclosure through two attack forms: the homogeneity attack and
the background knowledge attack. The homogeneity attack is possible when the
sensitive attributes lack diversity, while the background knowledge attack relies on
having partial knowledge of an individual or a distribution of sensitive and nonsensitive attributes in a population.
In 2006, Machanavajjhala [14] proposed the l-diversity method to address the
attribute linkage attack. The core principle of l-diversity is to prevent homogeneity
of sensitive attributes within equivalence classes. Specifically, a database is
considered l-diversed if there exist at least l distinct values of the sensitive attribute
in every equivalence class. To break the l-diversity, an adversary seeking to breach
the privacy of an individual within the database would need to possess (l-1) pieces of
background knowledge.
9
However, a drawback of l-diversity is that it can be too difficult or unnecessary to
achieve. Imagine a scenario with 10000 data records with a sensitive attribute. Let
the class be A for 99% of the individuals and B for the remaining 1%. Both possible
values have different sensitivities. One would not mind being tested negative, thus,
in an equivalence class that only contains A-entries for the sensitive attribute, ldiversity is unnecessary. But to achieve 2-diversity, there would only exist 10000 *
0.01 = 100 equivalence classes, which would lead to a large information loss.
Both l-diversity and k-anonymity are insufficient to prevent attribute disclosure, as
there are two conceivable attacks: the skewness attack and the similarity attack. If the
overall distribution of sensitive attributes is skewed, as in the example of 99% to 1%,
a skewness attack is conceivable. Imagine that one equivalence class contained the
same amount of positive and negative records. The database would still be 2-diverse,
but the risk of anyone in that class being considered positive would increase from 1%
to 50%, which could compromise the privacy of individuals. Similarity attacks are
possible if the sensitive attribute values within a class of equivalence are distinct but
semantically similar. Consider once more the hospital example and the equivalence
class with the sensitive attribute values pulmonary cancer, breast cancer, and brain
cancer. These are distinct values, but they all represent a form of cancer; therefore, a
person in this equivalence class has cancer.
To overcome the similarity attacks and skewness attacks of l-diversity, Li et al. [15]
introduced t-closeness, which is a privacy model that is more stringent than ldiversity. It requires that the distribution of the sensitive attribute values in each
equivalence class be t-close to the distribution of the sensitive attribute values in the
entire dataset. This ensures that an attacker cannot infer the value of a sensitive
attribute for an individual record by simply looking at the distribution of the sensitive
attribute values in the equivalence class that the record belongs to.
In other words, t-closeness ensures that the distribution of the sensitive attribute
values in each equivalence class is not significantly different from the distribution of
10
the sensitive attribute values in the entire dataset. This makes it more difficult for an
attacker to infer the value of a sensitive attribute for an individual record.
Although t-closeness is better than l-diversity in terms of protection, it has big
drawbacks, due to the expensive computational, and big loss in data utility. Therefore,
in the era of big-data, l-diversity and t-closeness are hardly mentioned as bestpractices.
1.3.
Table linkage and δ-presence privacy model
The concepts of record linkage and attribute linkage typically assume that the
attacker is already aware of the presence of the victim's record in the published
table. However, this may not always be the case. In some situations, the mere
existence or absence of the victim's record in the dataset can reveal sensitive
information. For instance, consider a scenario where a medical facility releases a
data table concerning a specific illness. The mere presence of the victim's record in
the database can have detrimental consequences. Table linkage occurs when an
attacker can reliably deduce either the presence or absence of the victim's record in
the published table.
In 2007, Nergiz (adapted from [7], [16]) proposed a solution to the table linkage
problem called the delta-presence (δ-presence) approach. The fundamental idea is to
constrain or "bound" the probability of inferring the presence of each potential
victim's data within a designated range referred to as δ. The δ-presence approach
offers the advantage of indirectly preventing record and attribute linkages. If an
attacker has a confidence level of at most δ% regarding the presence of the victim's
record in the leaked table, then the likelihood of successfully linking to their record
and sensitive attributes is limited to at most percent. Therefore, δ-presence can
indirectly mitigate the risks associated with record and attribute connections.
While the δ-presence approach provides a "safe" privacy model, it relies on the
assumption that the data publisher has access to the same external table as the
attacker, which may not be a realistic assumption in practice.