Tải bản đầy đủ (.pdf) (71 trang)

Privacy preserving in sharing data

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.08 MB, 71 trang )

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

NHAM LAP HANH

PRIVACY PRESERVING IN SHARING DATA

Major:
Code:

: COMPUTER SCIENCE
: 8480101

MASTER’S THESIS

HO CHI MINH CITY, July 2023


THIS THESIS IS COMPLETED AT
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM
Supervisor : Dr. TRUONG TUAN ANH
Examiner 1: Assoc. Prof. Dr. NGUYEN TUAN DANG
Examiner 2: Assoc. Prof. Dr. NGUYEN VAN VU

This master’s thesis is defended at HCM City University of Technology,
VNU - HCM City on July 11, 2023.
Master’s Thesis Committee:
1. Chairman: Assoc. Prof. Dr. LE HONG TRANG
2. Secretary: Dr. PHAN TRONG NHAN
3. Examiner 1: Assoc. Prof. Dr. NGUYEN TUAN DANG
4. Examiner 2: Assoc. Prof. Dr. NGUYEN VAN VU


5. Commissioner: Dr. TRUONG TUAN ANH

Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of
Computer Science and Engineering after the thesis being corrected (If any).

CHAIRMAN OF THESIS
COMMITTEE

Assoc. Prof. Dr. LE HONG TRANG

HEAD OF FACULTY OF
COMPUTER SCIENCE AND
ENGINEERING




VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS
Full name: NHAM LAP HANH
Date of birth: November 23, 1991
Major: Computer Science
I.

Student ID: 1991020

Place of birth: HCM City
Major ID: 8480101

THESIS TITLE: PRIVACY PRESERVING IN SHARING DATA / BẢO VỆ
QUYỀN RIÊNG TƯ TRONG DỮ LIỆU CHIA SẺ

II.

TASKS AND CONTENTS:

-

Researched data anonymization.

-

Researched and analyzed the strengths and weaknesses of existing data
anonymization solutions.

-

Based on this, proposed an improved algorithm for data anonymization.

-

Implemented and evaluated the improved algorithm.

III.

THESIS START DAY: February 01, 2023


IV.

THESIS COMPLETION DAY: May 31, 2023

V.

SUPERVISOR: Dr. TRUONG TUAN ANH
Ho Chi Minh City, June 09, 2023
SUPERVISOR

CHAIR OF PROGRAM COMMITTEE

Dr. TRUONG TUAN ANH
DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING




ACKNOWLEDGEMENTS
I am deeply grateful to Dr. Truong Tuan Anh for his wholehearted help and
guidance during my thesis work. He provided me with valuable insights and feedback,
and his support was essential to the completion of my thesis.
I would also like to thank Assoc. Prof. Dr. Nguyen Tuan Dang and Assoc. Prof.
Dr. Nguyen Van Vu for their useful comments on how to complete my thesis before
defending it. Their advice was invaluable, and I am very grateful for their help.
Besides, on behalf of all students, I would like to express my gratitude to the
teachers of Ho Chi Minh City University of Technology, especially those in the
Faculty of Computer Science and Engineering. Your enthusiastic guidance and
support have been invaluable to us, both during our time at the university and in our

later career paths.
Finally, I would like to thank my family and friends for their help, support, and
encouragement throughout this process. They were always there for me, and I could
not have done it without them.




ABSTRACT
Currently, the amount of data generated is growing exponentially every year.
This data is collected and used to discover valuable information using data mining
techniques. However, this data can contain sensitive information, so it is important to
analyze it in a way that protects privacy. Many privacy protection models have been
proposed to apply this situation, but k-anonymity is the most widely used because it is
practical and easy to implement. Many researchers are also working on ways to
protect data privacy while maximizing the data utility. However, these methods are so
general and do not focus on a specific data mining technique. This thesis proposes an
algorithm that protects data privacy while maximizing its usefulness for a specific data
mining technique.




TÓM TẮT LUẬN VĂN
Hiện nay, lượng dữ liệu được tạo ra đang tăng theo cấp số nhân hàng năm. Dữ
liệu này được thu thập và sử dụng để khám phá thông tin giá trị bằng cách sử dụng kỹ
thuật khai thác dữ liệu. Tuy nhiên, dữ liệu này có thể chứa các thơng tin nhạy cảm, vì
vậy cần phải phân tích nó theo cách bảo vệ quyền riêng tư. Nhiều mơ hình bảo vệ
quyền riêng tư đã được đề xuất để áp dụng tình huống này, nhưng k-anonymity là
được sử dụng rộng rãi nhất vì nó thực tế và dễ thực hiện trong đa số các trường hợp.

Nhiều nhà nghiên cứu cũng đang nghiên cứu cách bảo vệ quyền riêng tư của dữ liệu
đồng thời tối đa hóa tính hữu ích của dữ liệu. Tuy nhiên, các phương pháp này rất
chung chung và không tập trung vào một kỹ thuật khai thác dữ liệu cụ thể. Luận án
này đề xuất một thuật toán bảo vệ quyền riêng tư của dữ liệu đồng thời tối đa hóa tính
hữu ích của nó cho một kỹ thuật khai thác dữ liệu cụ thể.




THE COMMITMENT OF THE THESIS’ AUTHOR:
I hereby declare that this is my own research work.
The data and results presented in the thesis are honest and have never been
published in any previous works.
Student
NHAM LAP HANH




Table of contents
THE TASK SHEET OF MASTER’S THESIS............................................................ ⅰ
ACKNOWLEDGEMENTS........................................................................................ ⅱ
ABSTRACT............................................................................................................... ⅲ
TÓM TẮT LUẬN VĂN............................................................................................ ⅳ
THE COMMITMENT OF THE THESIS’ AUTHOR:...............................................ⅴ
CHAPTER 1: INTRODUCTION............................................................................ 1
1.1. General introduction....................................................................................... 1
1.2. Research problem........................................................................................... 2
1.3. Objectives of the topic.................................................................................... 4
1.4. Research significances.................................................................................... 4

CHAPTER 2: OVERVIEW..................................................................................... 5
2.1. Information Anonymization........................................................................... 5
2.2. Privacy Preserving Data Mining................................................................... 11
2.3. Related works............................................................................................... 14
2.3.1. One-Pass K-Means Algorithm............................................................. 17
2.3.2. K-Anonymity Algorithm Based On Improved Clustering................... 18
2.3.3. A clustering-based anonymization approach for privacy-preserving in
the healthcare cloud....................................................................................... 19
2.3.4. Mondrian.............................................................................................. 19
2.3.5. NonHomogenous Anonymization Approach Using Association Rule
Mining For Preserving Privacy...................................................................... 20
2.3.6. M3AR Algorithm................................................................................. 21
CHAPTER 3: THE PROPOSED ALGORITHM................................................ 27
3.1. Definitions.................................................................................................... 27
3.2. Budgets Calculation...................................................................................... 28
3.3. The impact of member migration on association rules.................................32
3.4. The proposed algorithm................................................................................ 34
CHAPTER 4: EVALUATION................................................................................43
4.1. Implementation............................................................................................. 43
4.2. Dataset.......................................................................................................... 43
4.3. Metrics.......................................................................................................... 45
4.4. Evaluation..................................................................................................... 47
CHAPTER 5: CONCLUSIONS:........................................................................... 55
REFERENCES........................................................................................................ 57

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH





Table of Figures
Figure 1.1: Overview of Privacy-Preserving Data Publishing.................................... 1
Figure 1.2. Illustrate the process of Anonymize and mine......................................... 3
Figure 2.1: An example of domain and value generalization hierarchies................... 8
Figure 2.2: An example of generating Association Rule using Apriori algorithm... 14
Figure 2.3: An example of attacks on k-anonymity.................................................. 15
Figure 2.4: A classification tree example for a category attribute............................ 16
Figure 2.5: Core process of the Mondrian algorithm................................................ 20
Figure 2.6: The Block Diagram Of Nonhomogenous Anonymization Approach.... 21
Figure 2.7: The M3AR algorithm pseudocode......................................................... 24
Figure 2.8: Disperse function of M3AR algorithm................................................... 25
Figure 4.1: Results on metric Lost Rules Percentage............................................... 49
Figure 4.2: Results on metric New Rules Percentage............................................... 50
Figure 4.3: Results on metric Different Rules Percentage........................................ 51
Figure 4.4: Results on metric Average Group Size................................................... 52
Figure 4.5: Results on metric Running Time............................................................ 53

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH




Table of Tables
Table 2.1. k-anonymity, types of attribute in dataset ................................................. 6
Table 2.2: Microdata table of criminal records........................................................... 9
Table 2.3: A 3-anonymous version of Table 2.2....................................................... 10

Table 2.4 : Example of Association Rule Mining..................................................... 12
Table 2.5: Example of 3-anonymity on tuples member migration technique........... 22
Table 4.1: The metrics function description.............................................................. 45

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


1

CHAPTER 1: INTRODUCTION
1.1.

General introduction
Currently, the volume of data generated increases exponentially every year.

This data has brought many benefits to many organizations, such as storing, sharing,
and exploiting data using data mining techniques. Through data mining, some
valuable information can be discovered from that shared data. Massive data
generated from various sources can be processed and analyzed to support
decision-making. Among this data, more and more personal information is
contained within, it leads to serious privacy concerns. Therefore, analyzing
privacy-preserving data becomes very important.
Besides, governments have published data privacy rules, such as HIPAA
(Health Insurance Portability and Accountability Act) in the US and the Data
Protection Regulation GDPR in Europe in order to control the use and sharing of
data to protect user privacy. Any organization that is found to be disclosing user
information will be subject to severe fines.


Figure 1.1: Overview of Privacy-Preserving Data Publishing. [1]
As a result, Privacy Protected Data Publishing (PPDP) [19] has become an
area of interest to researchers and practitioners. A typical scenario of PPDP is
depicted in Figure 1.1, showing the different stages of data processing. One key
assumption of the PPDP model is that attackers can be found among data recipients

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


2

who intend to discover sensitive information about individuals. Therefore, the goal
of PPDP techniques is to modify data by making it less specific in a way that
protects the privacy of individuals; while aiming to maintain the usefulness of
anonymous data. The essence of PPDP is to create datasets that have good utility for
various tasks since, typically, all potential use scenarios for the data are unknown at
the time of publication. For example, under open data initiatives, it is not possible to
identify all data recipients. Therefore, any data controller involved in sharing
personal data should apply privacy protection mechanisms.
1.2.

Research problem
Many PP models have been proposed to adopt this situation. These models

have been developed to consider different attack scenarios against data. For
example, assuming an attacker has the basic knowledge to varying degrees could
lead to information disclosure. Examples of well-known patterns are k-anonymity
[2], l-diversity [3], t-closeness [4] and differential privacy [17]. Among these

models, k-anonymity is focused on and used widely by researchers and
organizations because this model is realistic and can be easily achieved in most
cases. Furthermore, although it has been shown that k-anonymity is vulnerable to
specific attacks, it allows for general-purpose data to be published with reasonable
utility. This is in contrast to more robust models (e.g., differential privacy) which
might hamper data quality in order to preserve a stricter guarantee of privacy [18].
These characteristics may make k-anonymity attractive to practitioners, who can
adopt it within their organization for an anonymity strategy with a formal guarantee
of privacy.
However, in reality, the mining process is executed by the difference of
organization or individual through a technique called data mining technique
(association rules mining for example). Privacy-Preserving Data Mining (PPDM)
[23] methodologies is a branch of PPDM, it is designed to guarantee a certain level
of privacy, while maximizing the utility of the data, such that data mining can still
be performed on the transformed data efficiently.

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


3

Some benefits of the information technologies are only possible through the
collection and analysis of (sometimes sensitive) data. However, transforming the
data may also reduce its utility, resulting in inaccurate or even infeasible extraction
of knowledge through data mining. Privacy-Preserving Data Mining (PPDM)
methodologies are designed to guarantee a certain level of privacy, while
maximizing the utility of the data, such that data mining can still be performed on
the transformed data efficiently.


Figure 1.2. Illustrate the process of Anonymize and mine
A simple example of data sharing is shown in Figure 1.2: data owner of
dataset D containing sensitive data, approximate data, and identifiers. Some other
data miners require data for their purposes (to analyze and mine the data using
association rules techniques to extract valuable information). To protect the privacy
of users with sensitive data, data owners must use PP algorithms (e.g. anonymize k)
to anonymize the original data from dataset D to dataset D' for sharing with this
data mining tool. The data miner will then use a mining technique (e.g. association
rules) to mine dataset D' to get valuable information.
As a result, the key objective here is data anonymization but still retain all
association rules in dataset D (especially those with high support and high
confidence) maintained in dataset D' after anonymization, so data mining on dataset
D' will give the same result as on D.

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


4

1.3.

Objectives of the topic
As mentioned, many studies have been conducted on anonymizing data

before releasing it to third parties. Maximizing the utility of data and minimizing
privacy risk are two opposing goals. As a result, the goal of this thesis is to
investigate the state of algorithms that preserve data privacy while ensuring good

data mining results. Finally, this thesis will propose an efficient algorithm for
achieving k-anonymity that outperforms current state-of-the-art algorithms.
Additionally, the proposed algorithm preserves significant association rules in
k-anonymity data so that the data mining process based on association rule mining
can preserve valuable information as in the original data.
1.4.

Research significances
In recent years, there has been a growing trend of data collection and

information extraction. This has led to an increased risk of sensitive information
being leaked. As a result, Privacy Protected Data Publishing (PPDP) has become an
important area of research. PPDP is the process of publishing data in a way that
protects the privacy of individuals. This can be done by anonymizing the data,
which means removing or obscuring personal information. There are a number of
different anonymization techniques, each with its own strengths and weaknesses.
One common anonymization technique is k-anonymity. k-anonymity ensures that
each individual in a dataset cannot be identified by looking at only a few pieces of
information. This is done by grouping together records that are similar in certain
ways. However, many k-anonymity methods use generalization and suppression
operations but the generalization and suppression have some weaknesses.
Therefore, the adoption of the tuple member migration that advantages these two
operations in k-anonymity will improve the data quality of the anonymization
process. Also, this will be useful in practice for organizations or individuals with an
algorithmic need to share their data for their clients to mine.

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH



5

CHAPTER 2: OVERVIEW
2.1.

Information Anonymization
Information anonymization [15] is the way toward expelling by and by

identifiable data from informational indexes, to make the general population
unknown about whom the information describes. It allows the exchange of
information over a limit, as between two offices inside focus or between two
offices, though lessening the peril of accidental uncovering, and in bound conditions
in

an

exceedingly

way

that

grants

investigation

and

examination


post-anonymization. This system is utilized as a part of undertakings to expand the
security of the information while enabling the information to be broken down and
utilized. It changes the information that will be utilized or distributed to keep the
distinguishing proof of key data. There are a number of different information
anonymization methods, each with its own advantages and disadvantages. Some of
the most common methods include:
● k-anonymity [2]: This method ensures that no individual in the dataset can be
uniquely identified by a combination of values of sensitive attributes. For
example, if a dataset contains the names, ages, and addresses of individuals,
k-anonymity would ensure that no two individuals have the same name, age,
and address.
● l-diversity [3]: This method ensures that there are at least l different values
for each sensitive attribute in the dataset. For example, if a dataset contains
the ages of individuals, l-diversity would ensure that there are at least l
different ages in the dataset.
● t-closeness [4]: This method ensures that the distribution of sensitive
attributes in the anonymized dataset is similar to the distribution of sensitive
attributes in the original dataset. For example, if a dataset contains the ages
of individuals, t-closeness would ensure that the proportion of individuals in
each age group is the same in the anonymized dataset as it is in the original
dataset.

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


6


Among these models, k-anonymity is focused on and used widely by
researchers and organizations because this model is realistic and can be easily
achieved in most cases. Furthermore, although it has been shown that k-anonymity
is vulnerable to specific attacks, it allows for general-purpose data to be published
with reasonable utility. This is in contrast to more robust models (e.g., differential
privacy) which might hamper data quality in order to preserve a stricter guarantee of
privacy [18]. These characteristics may make k-anonymity attractive to
practitioners, who can adopt it within their organization for an anonymity strategy
with a formal guarantee of privacy.
K-Anonymity
The k-anonymity [2] model is an approach to protect data from individual
identification. It works by ensuring that each record of a table is identical to at least
k−1 other records with respect to a set of privacy-related attributes, called
quasi-identifiers (Table 2.1 shows 3 different types of attribute), that could be
potentially used to identify individuals by linking these attributes to external data
sets.
Table 2.1. k-anonymity, types of attributes in a dataset.
Type of Attributes

Description

Explicit identifier

Attributes

(ID)

people or objects.

Quasi attributes

(QIDs)

Example
explicitly

identify

ID, NAME

A collection of characteristics MARITAL STATUS,
that can be used to identify AGE, SEX,
people or objects.

Sensitive attributes

Sensitive

(SA)

concealed

information

ZIP CODE
to

be CRIME, SALARY,
DISEASE

This anonymization can be done by Generalization as well as Suppression

techniques:

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


7

● Generalization is the way toward changing over an incentive into a less
particular general term. Generalization is based on a domain generalization
hierarchy and a corresponding value generalization hierarchy on the values
in the domains. Typically, the domain generalization hierarchy is a total order
and the corresponding value generalization hierarchy a tree, where the
parent/child relationship represents the direct generalization/specialization
relationship. Figure 2.1 illustrates an example of possible domain and value
generalization hierarchies for the quasi-identifying attributes of our example
○ Attribute (AG): Generalization is performed at the segment level; all
the qualities in the section are generalized at a speculation step.
○ Cell (CG): Generalization can likewise be performed on a solitary
cell; at long last a summed up table may contain, for a particular
section and values at various levels of generalization.
● Suppression comprises in averting delicate information by evacuating it.
Suppression can be connected at the level of a single cell, whole tuple, or
whole segment, permitting diminishing the measure of speculation to be
forced to accomplish k-anonymity.
○ Tuple (TS): Suppression is performed at column level; suppression
operation evacuates entire tuple.
○ Attribute (AS): Suppression is performed at segment level;
suppression operation shrouds every one of the estimations of a

segment.
○ Cell (CS): Suppression is performed at single cell level; at long last
k-anonymized table may wipe out just certain cells of a given
tuple/quality.

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


8

Figure 2.1: An example of domain and value generalization hierarchies [1]
For example, consider the criminal records in Table 2.2, among the
attributes, name is the identifier (ID), marital status, age, and ZIP code are the
quasi-identifier (QIDs) (the combination of this may decide an exact person name
with ID and Crime), crime is the sensitive attribute (SA) (i.e., a user's private
information that we do not want to disclose). For instance, the set of values
{Separated, 29, 32042} of {marital-status, age, ZIP code} will identify the tuple
ID=1 (because only this ID has such data). Then the attacker can know Crime of
tuple 1 (the sensitive data of user one is disclosed, the privacy is violated).

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


9

Table 2.2: Microdata table of criminal records.

ID

QIDs

Name

Marital Status

1

Joe

Separated

2

Jill

3

Tuple#

SA
Age

ZIP Code

Crime

29


32042

Murder

Single

20

32021

Theft

Sue

Widowed

24

32024

Traffic

4

Abe

Separated

28


32046

Assault

5

Bob

Widowed

25

32045

Piracy

6

Amy

Single

23

32027

Indecency

After k-anonymity, Table 2.3 shows a 3-anonymous version of Table 2.2,

where anonymization is achieved via generalization at the attribute level [3], i.e., if
two records contain the same value at a QIDs, they will be generalized to the same
value at the QIDs as well, which means that each tuple has at least three other tuples
sharing the same values of the QIDs. To achieve anonymity, the ID has been
removed and the QIDs have been generalized:
● the marital-status has been replaced with a less specific but semantically
consistent description
● the age values have been replaced with ranges of values
● and the last digit of the ZIP code has been replaced by a “*”

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


10

Table 2.3: A 3-anonymous version of Table 2.2
QIDs
Tuple#

EQ

Age

ZIP Code

Crime

Not Married


[25-30)

3204*

Murder

Not Married

[25-30)

3204*

Assault

5

Not Married

[25-30)

3204*

Piracy

2

Not Married

[20-25)


3202*

Theft

Not Married

[20-25)

3202*

Traffic

Not Married

[20-25)

3202*

Indecency

1
4

3
6

1

2


Marital Status

SA

Assume that the data in Table 2.2 will be shared to organizations (called as
the collectors). Clearly, if the original data in Table 2.2 are shared to the collectors
(e.g., it is possible an attacker), the attacker can know about the sensitive data of the
person. Thus, to preserve privacy, we need to perform k-anonymity on the dataset D
in Table 2.2 to hide the private data. Because ID is the explicit identifier, it will be
hidden after the anonymization. Crime is the sensitive attribute and intuitively, we
can hide the crime data in the output Table 2.3 and so, the attacker cannot know the
crime of any person (privacy is protected). However, in many real cases, the data in
sensitive attributes contain valuable information for the data collector. Therefore,
they must be kept as in the original data when sharing the data to collectors (e.g.,
because sensitive attributes are valuable data, so if we hide or delete them from the
original data, the shared data becomes useless to the collectors). Now, based on
Table 2.3, the attacker can get difficulty in deciding the exact crime information of
any person.

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


11

2.2.

Privacy Preserving Data Mining

With the large amount of data collected, organizations can greatly benefit

from the knowledge extraction of this available information. The process of
extracting patterns (knowledge) from large data sets, which can then be represented
and interpreted is called Data Mining. Three of the most common approaches in
machine learning are Association rule mining, classification and clustering,
where Association rule mining, classification are supervised learning techniques,
and clustering is an unsupervised learning mechanism.
Among the machine learning techniques, Association rule mining [19] is
one of the most popular techniques in data mining. An association rule mining
algorithm is an algorithm that mines the association between things and is often
used to mine the association knowledge between things. For instance, it can find
potential connections between disease from the patient records of habit and gender,
which can help doctors discover and anticipate the patient's ability to get sick and
prevent it in time.
The algorithm uses a support-confidence framework to evaluate the value of
association rules 𝑟 {𝐴, 𝐵} → {𝐶}:
● The support: The frequency with which an item or itemset appears in the
dataset D
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑟) =

𝑓𝑟𝑞(𝐴,𝐵,𝐶)
𝑁

Intuitively, this metric reflects the usefulness of the rule 𝑟 .
● The confidence: The likelihood that a rule is correct or true, given the
occurrence of the antecedent and consequent in the dataset D
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑟) =

𝑓𝑟𝑞(𝐴,𝐵,𝐶)

𝑓𝑟𝑞(𝐴,𝐵)

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


12

Table 2.4 : Example of Association Rule Mining
Age

Gender

Zip Code

SmokingHabit

Disease

34

F

110092

Smoker

Flu


58

M

110032

Smoker

Cancer

46

F

110156

Nonsmoker

Cancer

26

F

113398

Smoker

AIDS


18

F

113301

Nonsmoker

None

42

M

110045

Smoker

Flu

50

M

110087

Nonsmoker

Asthma


52

F

110076

Smoker

Cancer

38

F

118970

Nonsmoker

Flu

45

M

110045

Smoker

Cancer


For example: Giving Table 2.4, we find association rules based on Age and
SmokingHabit columns. Let assume, the minimum threshold of support value
min_sup=30% and minimum threshold of confidence value min_conf= 30%. A rule
will be considered as a significant rule if its support value is greater than 30%,
confidence value is greater than 30%.
We find the following rule: If a person is above 40 and he is a smoker then
he is more likely to have cancer.
𝑟1 = {𝐴𝑔𝑒 ≥ 40, 𝑆𝑚𝑜𝑘𝑖𝑛𝑔𝐻𝑎𝑏𝑖𝑡 = 𝑆𝑚𝑜𝑘𝑒𝑟} → {𝐷𝑖𝑠𝑒𝑎𝑠𝑒 = 𝐶𝑎𝑛𝑐𝑒𝑟}
The calculation of support and confidence of rule 𝑟1:
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑟1) =

3
10

= 0. 3

SUPERVISOR: Dr. TRUONG TUAN ANH

(2.1)

STUDENT: NHAM LAP HANH


13

𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑟1) =

3
4


= 0. 75

(2.2)

As 𝑠𝑢𝑝𝑝𝑜𝑟𝑡≥0. 3 and 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒≥0. 3, the above rule will be considered as
a significant rule.
In fact, not all rules are interesting to mine, association rule mining
algorithms only mine significant rules (rules meet the minimum support and
minimum confidence thresholds). That means association rule R’ of dataset D’ after
anonymization is similar or same as the association rule R of the original dataset D.
Therefore, the goal of PPDM is to try to uphold these important rules. Several
algorithms are used in association rule mining, each with its own strengths and
weaknesses.
● Apriori algorithm: One of the most popular association rule mining
algorithms is the Apriori algorithm. The Apriori algorithm is based on the
concept of frequent itemsets, which are sets of items that occur together
frequently in a dataset. The algorithm works by first identifying all the
frequent itemsets in a dataset, and then generating association rules from
those itemsets.
● FP-Growth algorithm: In large datasets, FP-growth is a popular method for
mining frequent item sets. It generates frequent itemsets efficiently without
generating candidate itemsets using a tree-based data structure called the
FP-tree. As a result, it is faster and more memory efficient than the Apriori
algorithm when dealing with large datasets.
● Eclat algorithm: Equivalence Class Transformation, or Eclat is another
popular algorithm for Association Rule Mining. Compared to Apriori, Eclat
is designed to be more efficient at mining frequent itemsets. There are a few
key differences between the Eclat algorithm and the Apriori algorithm. To
mine the frequent itemsets, Eclat uses a depth-first search strategy instead of
candidate generation. Eclat is also designed to use less memory than the

Apriori algorithm, which can be important when working with large datasets.

SUPERVISOR: Dr. TRUONG TUAN ANH

STUDENT: NHAM LAP HANH


14

Figure 2.2: An example of generating Association Rule using Apriori algorithm

2.3.

Related works
Many PP models have been proposed to adopt this situation. Examples of

well-known patterns are k-anonymity [2], l-diversity [3], t-closeness [4] and
differential privacy [17].
Among these models, k-anonymity is focused on and used widely by
researchers and organizations because this model is realistic and can be easily
achieved in most cases. Although it has been shown that k-anonymity is vulnerable
to specific attacks such as Homogeneity Attack, and Background Knowledge Attack
(figure 2.3), it allows for general-purpose data to be published with reasonable
utility. This is in contrast to more robust models (e.g., differential privacy) which
might hamper data quality in order to preserve a stricter guarantee of privacy [18].
These characteristics may make k-anonymity attractive to practitioners, who can
adopt it within their organization for an anonymity strategy with a formal guarantee
of privacy.

SUPERVISOR: Dr. TRUONG TUAN ANH


STUDENT: NHAM LAP HANH


15

Figure 2.3: An example of attacks on k-anonymity.
Some of the k-anonymity algorithms clustering in attribute hierarchical
structures applied include, Jiuyong Li, Raymond Chi-Wing Wong, AdaWai-Chee
Fu, , Jian Pei, A. Abbasi , W. Zheng, A. Almohaimeed, Y. Yan, K. Arava
[5,6,7,21,22,26,27] to ensure Anonymization. Clustering is based on the attribute
similarity of the original dataset to divide the data into several clusters, so that
members of the same cluster possess similar attribute values, but the attribute values
of members in different clusters need to be different. Tiancheng Li, Ninghui Li, Jian
Zhang, Ian Molloy [4] proposed a novel technique called slicing, which partitions
the

data both horizontally and vertically. Swati Goel [13] uses the

non-homogeneous to anonymization, but also considers a specific data mining
technique which is association rule mining; it reduces the data disclosure risk factor
of the anonymized data. T. Dang and J. Küng [10], the research does not use two
popular techniques (generalization and suppression) because data after anonymizing
by these techniques may not be significant to the data mining processes. The author
proposed a technique which is called Migrate Member technique to anonymize the
database. A. Truong, T. Dang [11, 12] applied the Migrate Member technique to
protect the location data. The approach also considers the result of data mining

SUPERVISOR: Dr. TRUONG TUAN ANH


STUDENT: NHAM LAP HANH


×