Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 65 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (136.99 KB, 10 trang )

620 Maria Halkidi and Michalis Vazirgiannis
31.2.3 Interestingness Measures of Classification Rules
The number of classification patterns generated could be very large and it is possible that
different approaches do result in different sets of patterns. The patterns extracted during the
classification process could be represented in the form of rules, known as classification rules.
It is important to evaluate the discovered patterns identifying these ones that are valid and pro-
vide new knowledge. Techniques that aim at this goal are broadly referred to as interesting-
ness measures. The interestingness of the patterns that discovered by a classification approach
could also be considered as another quality criterion. Some representative measures (Hilder-
man and Hamilton, 1999) for ranking the usefulness and utility of discovered classification
patterns (classification rules) are:
• Rule-Interest Function. Piatetsky-Shapiro introduced the rule-interest (Piatetsky-Shapiro,
1991) that is used to quantify the correlation between attributes in a classification rule. It
is suitable only for the single classification rules, i.e. the rules whose both the left- and
right-hand sides correspond to a single attribute.
• Smyth and Goodman’s J-Measure. The J-measure (Smyth and Goodman, 1991) is a mea-
sure for probabilistic classification rules and is used to find the best rules relating discrete-
valued attributes. A probabilistic classification rule is a logical implication, X →Y, satis-
fied with some probability p. The left- and right-hand sides of this implication correspond
to a single attribute. The right-hand side is restricted to simple single-valued assignment
expression while the left-hand-side may be a conjunction of simple expressions.
• General Impressions. In (Liu et al., 1997) general impression is proposed as an approach
for evaluating the importance of classification rules. It compares discovered rules to an
approximate or vague description of what is considered to be interesting. Thus a general
impression can be considered as a kind of specification language.
• Gago and Bento’s Distance Metric. The distance metric (Gago and Bentos, 1998) mea-
sures the distance between classification rules and is used to determine the rules that pro-
vide the highest coverage for the given data. The rules with the highest average distance
to the other rules are considered to be most interesting.
For additional discussion regarding interestingness measures please refer to Chapter 29.5
in this volume.


31.3 Association Rules
Mining rules is one of the main tasks in the Data Mining process. It has attracted considerable
interest because the rule provides a concise statement of potentially useful information that is
easily understood by the end-users.
There is a lot of research in the field of association rule extraction, resulting in a variety
of algorithms that efficiently analyzing data and extract rules from them. The extracted rules
have to satisfy some user-defined thresholds related with association rule measures (such as
support, confidence, leverage, lift).
These measures give an indication of the association rules’ importance and confidence.
They may represent the predictive advantage of a rule and help to identify interesting patterns
of knowledge in data and make decisions. Below we shall briefly summarize these measures.
31 Quality Assessment Approaches in Data Mining 621
31.3.1 Association Rules Interestingness Measures
Let LHS → RHS be an association rule. Further we refer to the left hand side and the right
hand side of the rule as LHS and RHS respectively. Below some of the most known measures
of the rule interestingness are presented (Han and Kamber, 2001, Berry and Linoff, 1996).
Coverage
The coverage of an association rule is the proportion of cases in the data that have the attribute
values or items specified on the Left Hand Side of the rule:
Coverage =
n(LHS)
N
(31.5)
or
Coverage = P(LHS)
where N is the total number of cases under consideration and n(LHS) denotes the number of
cases covered by the Left Hand Side. Coverage takes values in [0,1]. An association rule with
coverage value near 1 can be considered as an interesting association rule.
Support
The support of an association rule is the proportion of all cases in the data set that satisfy a

rule, i.e. both LHS and RHS of the rule. More specifically, support is defined as
Support =
n(LHS∩RHS)
N
(31.6)
or
Support = P(LHS∩RHS)
where N is the total number of cases under consideration and n(LHS) denotes the number of
cases covered by the Left Hand Side.
Support can be considered as an indication of how often a rule occurs in a data set and as
a consequence how significant is a rule.
Confidence
The confidence of an association rule is the proportion of the cases covered by LHS of the rule
that are also covered by RHS:
Con fidence =
n(LHS∩RHS)
n(LHS)
(31.7)
or
Con fidence =
P(LHS∩RHS)
P(LHS)
where n(LHS) denotes the number of cases covered by the Left Hand Side. Confidence takes
values in [0, 1]. A value of confidence near to 1 is an indication of an important association
rule.
The above discussed interestingness measures, support and confidence, are widely used
in the association rule extraction process and are also known as Agrawal and Srikant’s Itemset
measures. From their definitions, we could say that confidence corresponds to the strength
while support to the statistical significance of a rule.
622 Maria Halkidi and Michalis Vazirgiannis

Leverage
The leverage (MAGOpus) of an association rule is the proportion of additional cases covered
by both the LHS and RHS above those expected if the LHS and RHS were independent of each
other. This is a measure of the importance of the association that includes both the confidence
and the coverage of the rule. More specifically, it is defined as
Leverage = p(RHS|LHS) −(p(LHS) · p(RHS)) (31.8)
Leverage takes values in [−1, 1]. Values of leverage equal or under 0, indicate a strong
independence between LHS and RHS. On the other hand, values of leverage near to 1 are an
indication of an important association rule.
Lift
The lift of an association rule is the confidence divided by the proportion of all cases that
are covered by the RHS. This is a measure of the importance of the association and it is
independent of coverage.
Li f t =
p(LHS ∩RHS)
p(LHS) · p(RHS)
(31.9)
or
Li f t =
Con fidence
p(RHS)
It takes values in R
+
(the space of the real positive numbers). Based on the values of lift
we get the following inferences for the rules interestingness:
1. lift → 1 means that RHS and LHS are independent, which indicates that the rule is not
interesting.
2. lift values close to +∞. Here, we have the following sub-cases:
• RHS ⊆ LHS or LHS ⊆RHS. If any of these cases is satisfied, we may conclude that
the rule is not interesting.

• P(RHS) is close to 0 or P(RHS|LHS) is close to 1. The first case indicates that the
rule is not important. On the other hand, the second case is a good indication that the
rule is an interesting one.
3. lift = 0 means that P(RHS|LHS)=0 ⇔ P(RHS ∩LHS)=0, which indicates that the
rule is not important.
Further discussion on interestigness measures
Based on the definition of the association rules and the related measures it is obvious that
support is an indication of the rule’s importance based on the amount of data that support it.
For instance, assuming the rule A → B, a high support of the rule is an indication that a high
number of tuples contains both the left hand side and right hand side of this rule and thus it
can be considered as a representative rule of our data set. Moreover! confidence expresses our
confidence based on the available data that when the left hand side of the rule happens, the
right hand side also happens.
Though support and confidence are useful to mine association rules in many applications,
they could mislead us in some cases. Based on support-confidence framework a rule can be
identified as interesting even though the occurrence of A does not imply the occurrence of
B. In this case lift and leverage could be considered as alternative interestingness measures
giving also an indication about the correlation of LHS and RHS.
31 Quality Assessment Approaches in Data Mining 623
Also lift is another measure, which may give an indication of rule significance, or how
interesting is the rule. Lift represents the predictive advantage a rule offers over simply guess-
ing based on the frequency of the rule consequence (RHS). Thus, lift may be an indication
whether a rule could be considered as representative of the data so as to use it in the process
of decision-making (Roberto et al., 1999). For instance, let a rule A + B → G with confidence
85% and support(G)=90%. Due to the high confidence of the rule we may conclude that it
is a significant rule. On the other hand, the right hand side of the rule represents the 90% of the
studied data, that is, a high proportion of the data contains G. Then, the rule may not be very
interesting since there is a high probability the right hand side of the rule (G) to be satisfied
by our data. More specifically, the rule may be satisfied by a high percentage of the data under
consideration but at the same time the consequence of the rule (RHS) is high supported. As a

consequence this rule may not make sense in making decisions or extracting general rule as
regards the behavior of the data. Finally, the leverage expresses the hyper-representation of the
rule in relation with its representation in data set if there is no interaction between LHS and
RHS.
A similar measure, conviction! was proposed by Brin etal. (Brin et al., 1997). The formal
definition is (n is the number of transactions in the database):
Conviction =
n − p(RHS)
(1 −Con f idence)
(31.10)
Both the lift and the conviction are monotone in the confidence.
31.3.2 Other approaches for evaluating association rules
There are also some other well-known approaches and measures for evaluating association
rules are:
• Rule templates are used to describe a pattern for those attributes that can appear in the
left- or right-hand side of an association rule. A rule template may be either inclusive or
restrictive. An inclusive rule template specifies desirable rules that are considered to be
interesting. On the other hand a restrictive rule template specifies undesirable rules that are
considered to be uninteresting. Rule pruning can be done by setting support, confidence
and rule size thresholds.
• Dong and Li’s interestingness measure (Dong and Li, 1998) is used to evaluate the impor-
tance of an association rule by considering its unexpectedness in terms of other association
rules in its neighborhood. The neighborhood of an association rule consists of association
rules within a given distance.
• Gray and Orlowska’s Interestingness (Gray and Orlowka, 1998) to evaluate the confi-
dence of associations between sets of items in the extracted association rules. Though
suppor and confidence have been shown to be useful for characterizing association rules,
interestingness contains a discriminator component that gives an indication of the inde-
pendence of the antecedent and consequent.
• Peculiarity (Zhong et al., 1999) is a distance-based measure of rules interestingness. It

is used to determine the extent to which one data object differs from other similar data
objects.
• Closed Association Rules Mining. It is widely recognized that the larger the set of frequent
itemsets, the more association rules are presented to the user, many of which turn out to be
redundant. However it is not necessary to mine all frequent itemsets to guarantee that all
non-redundant association rules will be found. It is sufficient to consider only the closed
624 Maria Halkidi and Michalis Vazirgiannis
frequent itemsets (Zaki and Hsiao, 2002, Pasquier et al., 1999, Pei et al., 2000). The set
of closed frequent itemsets can guarantee completeness even in dense domains and all
non-redundant association rules can be defined on it. CHARM is an efficient algorithm
for closed association rules mining.
31.4 Cluster Validity
Clustering is a major task in the Data Mining process for discovering groups and identifying
interesting distributions and patterns in the underlying data (Fayyad et al., 1996). Thus, the
main problem in the clustering process is to reveal the organization of patterns into ‘sensible’
groups, which allow us to discover similarities and differences, as well as to derive useful
inferences about them (Guha et al., 1999).
In the literature a wide variety of algorithms have been proposed for different applica-
tions and sizes of data sets (Han and Kamber, 2001), (Jain et al., 1999). The application of an
algorithm to a data set aims at, assuming that the data set offers a clustering tendency, discov-
ering its inherent partitions. However, the clustering process is perceived as an unsupervised
process, since there are no predefined classes and no examples that would show what kind of
desirable relations should be valid among the data (Berry and Linoff, 1996). Then, the various
clustering algorithms are based on some assumptions in order to define a partitioning of a data
set. As a consequence, they may behave in a different way depending on: i) the features of the
data set (geometry and density distribution of clusters) and ii) the input parameter values.
A problem that we face in clustering is to decide the optimal number of clusters into which
our data can be partitioned. In most algorithms’ experimental evaluations 2D-data sets are used
in order that the reader is able to visually verify the validity of the results (i.e. how well the
clustering algorithm discovered the clusters of the data set). It is clear that visualization of the

data set is a crucial verification of the clustering results. In the case of large multidimensional
data sets (e.g. more than three dimensions) effective visualization of the data set would be
difficult. Moreover the perception of clusters using available visualization tools is a difficult
task for humans that are not accustomed to higher dimensional spaces.
As a consequence, if the clustering algorithm parameters are assigned an improper value,
the clustering method may result in a partitioning scheme that is not optimal for the specific
data set leading to wrong decisions. The problems of deciding the number of clusters (i.e.
partitioning) better fitting a data set as well as the evaluation of the clustering results has been
the subject of several research efforts (Dave, 1996, Gath and Geva, 1989, Theodoridis and
Koutroubas, 1999, Xie and Beni, 1991).
We shall now discuss the fundamental concepts of clustering validity. Furthermoref we
present the external and internal criteria in the context of clustering validity assessment while
the relative criteria will be discussed in Section 31.4.4.
31.4.1 Fundamental Concepts of Cluster Validity
The procedure of evaluating the results of a clustering algorithm is known under the term
cluster validity. In general terms, there are three approaches to investigating cluster validity
(Theodoridis and Koutroubas, 1999). The first is based on external criteria. This implies that
we evaluate the results of a clustering algorithm based on a pre-specified structure, which is
imposed on a data set and reflects our intuition about the clustering structure of the data set.
The second approach is based on internal criteria. The results of a clustering algorithm are
31 Quality Assessment Approaches in Data Mining 625
Fig. 31.1. Confidence interval for (a) two-tailed index, (b) right-tailed index, (c) left-tailed
index, where q
0
p
is the
ρ
proportion of q under hypothesis Ho
evaluated in terms of quantities that involve the data themselves (e.g. proximity matrix). The
third approach of clustering validity is based on relative criteria. Here the basic idea is the

evaluation of a clustering structure by comparing it to other clustering schemes, resulting with
the same algorithm but with different parameter values.
The first two approaches are based on statistical tests and their major drawback is their
high computational cost. Moreover, the indices related to these approaches aim at measuring
the degree to which a data set confirms an a priori specified scheme. On the other hand, the
third approach aims at finding the best clustering scheme that a clustering algorithm can define
under certain assumptions and parameters.
External and Internal Validity Indices
In this section, we discuss methods suitable for the quantitative evaluation of the clustering
results, known as cluster validity methods. However, these methods give an indication of the
quality of the resulting partitioning and thus they can only be considered as a tool at the
disposal of the experts in order to evaluate the clustering results.
The cluster validity approaches based on external and internal criteria rely on statistical
hypothesis testing. In the following section, an introduction to the fundamental concepts of
hypothesis testing in cluster validity is presented.
Hypothesis Testing in Cluster Validity
In cluster validity the basic idea is to test whether the points of a data set are randomly
structured or not. This analysis is based on the Null Hypothesis, denoted as Ho, expressed
as a statement of random structure of a data set X. To test this hypothesis we use statistical
tests, which lead to a computationally complex procedure. Monte Carlo techniques, discussed
below, are used as a solution to high computational problems (Theodoridis and Koutroubas,
1999).
626 Maria Halkidi and Michalis Vazirgiannis
How Monte Carlo is used in Cluster Validity
The goal of using Monte Carlo techniques is the computation of the probability density
function (pdf) of the validity indices. They rely on simulating the process of estimating the
pdf of a validity index using a sufficient number of computer-generated data. First, a large
amount of synthetic data sets is generated by normal distribution. For each one of these
synthetic data sets, called X
i

, the value of the defined index, denoted q
i
, is computed. Then
based on the respective values of q
i
for each of the data sets X
i
, we create a scatter-plot. This
scatter-plot is an approximation of the probability density function of the index. Figure 31.1
depicts the three possible cases of probability density function’s shape of an index q. There
are three different possible shapes depending on the critical interval
¯
D
ρ
, corresponding to
significant level
ρ
(statistic constant). The probability density function of a statistic index q,
under Ho, has a single maximum and the region is either a half line, or a union of two half
lines (Theodoridis and Koutroubas, 1999). Assuming that the scatter-plot has been generated
using r-values of the index q, called q
i
, in order to accept or reject the Null Hypothesis Ho we
examine the following conditions:
if the shape is right-tailed (Figure 31.1b) if (q’s value of our data set, is greater than
(1 −
ρ
) ·r of q
i
values) then Reject Ho else Accept Ho endif else if the shape is

left-tailed (Figure 31.1c), if (q’s value for our data set, is smaller than
ρ
·r rofq
i
values) then Reject Ho else Accept Ho endif else if the shape is two-tailed (Figure
31.1a) if (q is greater than (
ρ
/2) ·r number of q
i
values and smaller than (1−
ρ
/2) ·r
of q
i
values) then Accept Ho endif endif
31.4.2 External Criteria
Based on external criteria we can work in two different ways. Firstly, we can evaluate the
resulting clustering structure C, by comparing it to an independent partition of the data P
built according to our intuition about the clustering structure of the data set. Secondly, we can
compare the proximity matrix P to the partition P.
Comparison of C with partition P (non- hierarchical clustering)
Let C = {C
1
C
m
} be a clustering structure of a data set X and P = {P
1
, ,P
s
} be a defined

partition of the data. We refer to a pair of points (x
v
,x
u
) from the data set using the following
terms:
• SS: if both points belong to the same cluster of the clustering structure C and to the same
group of partition P.
• SD: if points belong to the same cluster of C and to different groups of P.
• DS: if points belong to different clusters of C and to the same group of P.
• DD: if both points belong to different clusters of C and to different groups of P.
Assuming now that a, b, c and d are the number of SS,SD,DS and DD pairs respectively,
then a +b + c+ d = M which is the maximum number of all pairs in the data set (meaning,
M = N ·(N −1)/2 where N is the total number of points in the data set).
31 Quality Assessment Approaches in Data Mining 627
Now we can define the following indices to measure the degree of similarity between C
and P:
1. Rand Statistic: R =(a + d)/M
2. Jaccard Coefficient: J = a/(a+ b + c) The above two indices range between 0 and 1, and
are maximized when m=s. Another index is the:
3. Folkes and Mallows index:
FM = a/

m
1
·m
2
=

a

a + b
·
a
a + c
(31.11)
where m
1
=(a + b), m
2
=(a + c). For the previous three indices it has been proven that
high values of indices indicate great similarity between C and P. The higher the values of
these indices are the more similar C and P are. Other indices are:
4. Huberts
Γ
statistic:
Γ
=(1/M)
N−1

i=1
N

j=i+1
X(i, j) ·Y (i, j) (31.12)
High values of this index indicate a strong similarity between the matrices X and Y.
5. Normalized
Γ
statistic:
ˆ
Γ

=

(1/M)

N−1
i=1

N
j=i+1
(X(i, j) −
μ
X
)(Y (i, j) −
μ
Y
)

σ
X
·
σ
Y
(31.13)
where X(i, j) and Y (i, j) are the (i, j) element of the matrices X , Y respectively that we
wish to compare. Also
μ
x
,
μ
y

,
σ
x
,
σ
y
are the respective means and variances of X , Y
matrices. This index takes values between −1 and 1.
All these statistics have right-tailed probability density functions, under the random hy-
pothesis. In order to use these indices in statistical tests we must know their respective prob-
ability density function under the Null Hypothesis, Ho, which is the hypothesis of random
structure of our data set. This means that using statistical tests, if we accept the Null Hy-
pothesis then our data are randomly distributed. However, the computation of the probability
density function of these indices is computationally expensive. A solution to this problem is
to use Monte Carlo techniques. The procedure is as follows:
Algorithm 1: Monte Carlo Algorithm
1: for i = 1tor do
2: Generate a data set X
i
with N vectors (points) in the area of X (i.e. having the same
dimension with those of the data set X).
3: Assign each vector y
j,i
of X
i
to the group that x
j
∈ X belongs, according to the
partition P.
4: Run the same clustering algorithm used to produce structure C, for each X

i
, and let Ci
the resulting clustering structure.
5: Compute q(C
i
) value of the defined index q for P and C
i
.
6: end for
7: Create scatter-plot of the r validity index values, q(C
i
) (that computed into the for loop).
628 Maria Halkidi and Michalis Vazirgiannis
After having plotted the approximation of the probability density function of the defined
statistic index, its value, denoted by q, is compared to the q(C
i
) values, further referred to as
q
i
. The indices R, J,FM,G defined previously are used as the q index mentioned in the above
procedure.
Comparison of P (proximity matrix) with partition P
Let P be the proximity matrix of a data set X and P be its partitioning. Partition P can be
considered as a mapping
g : X → 1, ,n
c
where n
c
is the number of clusters.
Assuming the matrix Y defined as:

Y (i, j)=

1ifg(x
i
) = g

x
j

,
0 otherwise
The
Γ
(or normalized
Γ
) statistic index can be computed using the proximity matrix P
and the matrix Y . Based on the index value, we may have an indication of the two matrices’
similarity.
To proceed with the evaluation procedure we use the Monte Carlo techniques as mentioned
above. In the ‘Generate’ step of the procedure the corresponding mappings gi is generated for
every generated X
i
data set. So in the ‘Compute’ step the matrix Y
i
is computed for each X
i
in
order to find the
Γ
i

corresponding statistic index.
31.4.3 Internal Criteria
Using this approach of cluster validity the goal is to evaluate the clustering result of an
algorithm using only quantities and features inherited from the data set. There are two cases
in which we apply internal criteria of cluster validity depending on the clustering structure: a)
hierarchy of clustering schemes, and b) single clustering scheme.
Validating hierarchy of clustering schemes
A matrix called cophenetic matrix, P
c
, can represent the hierarchy diagram that is produced by
a hierarchical algorithm. The element P
c
(i, j), of cophenetic matrix represents the proximity
level at which the two vectors x
i
and x
j
are found in the same cluster for the first time. We
may define a statistical index to measure the degree of similarity between P
c
and P (proximity
matrix) matrices. This index is called Cophenetic Correlation Coefficient and defined as:
CPCC =
(1/M) ·

N−1
i=1

N
j=i+1

d
ij
·c
ij

μ
P
·
μ
C


(1/M)

N−1
i=1

N
j=i+1
(d
2
ij

μ
2
P
)

·


(1/M)

N−1
i=1

N
j=i+1
(c
2
ij

μ
2
C
)

(31.14)
where M = N ·(N −1)/2 and N is the number of points in a data set. Also,
μ
p
and
μ
c
are the
means of matrices P and P
c
respectively, and are defined as follows:
μ
P
=(1/M)

N−1

i=1
N

j=i+1
P(i, j),
μ
C
=(1/M)
N−1

i=1
N

j=i+1
P
c
(i, j) (31.15)
Moreover, d
ij
, c
ij
are the (i, j) elements of P and P
c
matrices respectively. The CPCC are
between −1 and 1. A value of the index close to 1 is an indication of a significant similarity
31 Quality Assessment Approaches in Data Mining 629
between the two matrices. The procedure of the Monte Carlo techniques described above is
also used in this case of validation.

Validating a single clustering scheme
The goal here is to find the degree of match between a given clustering scheme C, consisting
of n
c
clusters, and the proximity matrix P. The defined index for this approach is Hubert’s G
statistic (or normalized G statistic). An additional matrix for the computation of the index is
used, that is
Y =

1ifx
i
and x
j
belong to different clusters
0 otherwise
where i, j = 1, ,N.
The application of Monte Carlo techniques is also the way to test the random hypothesis
in a given data set.
31.4.4 Relative Criteria
The basis of the above described validation methods is statistical testing. Thus, the major draw-
back of techniques based on internal or external criteria is their high computational demands.
A different validation approach is discussed in this section. It is based on relative criteria and
does not involve statistical tests. The fundamental idea of this approach is to choose the best
clustering scheme of a set of defined schemes according to a pre-specified criterion. More
specifically, the problem can be stated as follows:
Let P
alg
be the set of parameters associated with a specific clustering algorithm (e.g. the
number of clusters n
c

). Among the clustering schemes C
i
,i= 1, ,n
c
, is defined by a specific
algorithm. For different values of the parameters in P
alg
, choose the one that best fits the data
set.
Then, we can consider the following cases of the problem:
1. P
alg
does not contain the number of clusters, n
c
, as a parameter. In this case, the
choice of the optimal parameter values are described as follows: The algorithm runs for a
wide range of its parameters’ values and the largest range for which n
c
remains constant
is selected (usually n
c
<< N (number of tuples)). Then the values that correspond to the
middle of this range are chosen as appropriate values of the P
alg
parameters. Also, this
procedure identifies the number of clusters that underlie our data set.
2. P
alg
contains n
c

as a parameter. The procedure of identifying the best clustering scheme
is based on a validity index. Selecting a suitable performance index, q, we proceed with
the following steps:
• The clustering algorithm runs for all values of nc between a minimum n
c
min
and a
maximum n
c
max
. The minimum and maximum values have been defined a priori by
the user.
• For each of the values of n
c
, the algorithm runs r times, using different sets of values
for the other parameters of the algorithm (e.g. different initial conditions).
• The best values of the index q obtained by each n
c
is plotted as the function of n
c
.
Based on this plot we may identify the best clustering scheme. We have to stress that
there are two approaches for defining the best clustering depending on the behavior of q with
respect to nc. Thus, if the validity index does not exhibit an increasing or decreasing trend
as n
c
increases we seek the maximum (minimum) of the plot. On the other hand, for indices

×