Tải bản đầy đủ (.pdf) (34 trang)

Microsoft Data Mining integrated business intelligence for e commerc and knowledge phần 2 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (491.14 KB, 34 trang )

14 1.3 Benefits of data mining
Profitability and risk
reduction
Profitability and risk reduction use data mining to identify the attributes of the best
customers—to characterize customer characteristics through time so as to target the
appropriate customer with the appropriate product at the appropriate time. Risk
reduction approaches match the discovery of poor risk characteristics against cus-
tomer loan applications. This may suggest that some risk management procedures are
not necessary with certain customers—a profit maximization move. It may also sug-
gest which customers require special processing.
As can be expected, financial companies are heavy users of data mining to improve
profitability and reduce risk. Home Savings of America FSB, Irwindale, CA, the
nation’s largest savings and loan company, analyzes mortgage delinquencies, foreclo-
sures, sales activity, and even geological trends over five years to drive risk pricing.
According to Susan Osterfeldt, senior vice president of strategic technologies at
NationsBank Services Co., “We’ve been able to use a neural network to build models
that reduce the time it takes to process loan approvals. The neural networks speed
processing. A human has to do almost nothing to approve it once it goes through the
model.”
Loyalty management and
cross-selling
Cross-selling relies on identifying new prospects based on a match of their character-
istics with known characteristics of existing customers who have been and still are sat-
isfied with a given product. Reader’s Digest does analysis of cross-selling opportunities
to see if a promotional activity in one area is likely to respond to needs in another
area so as to meet as many customer needs as possible.
This is a cross-sell application that involves assessing the profile of likely purchasers
of a product and matching that profile to other products to find similarities in the
portfolio. Cross-selling and customer relationship management are treated exten-
sively in Mastering Data Mining (Berry and Linoff, 2000) and Building Data Mining
Applications for CRM (Berson, Smith, and Thearling).


Operational analysis and
optimization
Operational analysis encompasses the ability to merge corporate purchasing systems
to review and manage global expenditures and to detect spending anomalies. It also
includes the ability to capture and analyze operational patterns in successful branch
locations, so as to compare and apply lessons learned to other branches.
American Express is using a data warehouse and data mining technique to reduce
unnecessary spending, leverage its global purchasing power, and standardize equip-
ment and services in its offices worldwide. In the late 1990s, American Express began
merging its worldwide purchasing system, corporate purchasing card, and corporate
card databases into a single Microsoft SQL Server database. The system allows Amer-
ican Express to pinpoint, for example, employees who purchase computers or other
capital equipment with corporate credit cards meant for travel and entertainment. It
also eliminates what American Express calls “contract bypass”—purchases from ven-
dors other than those the company has negotiated with for discounts in return for
guaranteed purchase levels.
Table 1.1 Illustrative Data Mining Best Practices Drawn from Media Reports (continued)
1.3 Benefits of data mining 15
Chapter 1
Operational analysis and
optimization
(cont’d)
American Express uses Quest, from New York–based Information Builders, to score
the best suppliers according to 24 criteria, allowing managers to perform best-fit
analyses and trade-off analyses that balance competing requirements. By monitoring
purchases and vendor performance, American Express can address quality, reliability,
and other issues with IBM, Eastman Kodak Co., and various worldwide vendors.
According to an American Express senior vice president, “Many of the paybacks from
data mining, even at this early stage, will result from our increased buying power,
fewer uncontrolled expenses, and improved supplier responsiveness.”

Relationship marketing Relationship marketing includes the ability to consolidate customer data records so as
to form a high-level composite view of the customer. This enables the production of
individualized newsletters. This is sometimes called “relationship billing.”
American Express has invested in a massively parallel processor, which allows it to
vastly expand the profile of every customer. The company can now store every trans-
action. Seventy workstations at the American Express Decision Sciences Center in
Phoenix, AZ, look at data about millions of AmEx card members—the stores they
shop in, the places they travel to, the restaurants they’ve eaten in, and even economic
conditions and weather in the areas where they live. Every month, AmEx uses that
information to send out precisely aimed offers. AmEx has seen an increase of 15 per-
cent to 20 percent in year over year card member spending in its test market and
attributes much of the increase to this approach.
Customer attrition and
churn reduction
Churn reduction aims to reduce the attrition of valuable customers. It also aims to
reduce the attraction and subsequent loss of customers through low-cost, low-margin
recruitment campaigns, which, over the life cycle of the affected customer, may cost
more to manage than the income produced by the customer.
Mellon Bank of Pittsburgh is using Intelligent Miner to analyze data on the bank’s
existing credit card customers to characterize their behavior and predict, for example,
which customers are most likely to take their business elsewhere. “We decided it was
important for us to generate and manage our own attrition models,” said Peter
Johnson, vice president of the Advanced Technology Group at Mellon Bank.
Fraud detection Fraud detection is the analysis of fraudulent transactions in order to identify the sig-
nificant characteristics that identify a potentially fraudulent activity from a normal
activity.
Another strategic benefit of Capital One’s data mining capabilities is fraud detection.
In 1995, for instance, Visa and MasterCard’s U.S. losses from fraud totaled $702
million. Although Capital One will not discuss its fraud detection efforts specifically,
it noted that its losses from fraud declined more than 50 percent last year, in part due

to its proprietary data mining tools and San Diego–based HNC Software Inc.’s Fal-
con, a neural network–based credit card fraud detection system.
Table 1.1 Illustrative Data Mining Best Practices Drawn from Media Reports (continued)
16 1.3 Benefits of data mining
Campaign management IBM’s DecisionEdge campaign management module is designed to help businesses
personalize marketing messages and pass them to clients through direct mail, tele-
marketing, and face to face interactions. The product works with IBM’s Intelligent
Miner for Relationship Marketing.
Among the software’s features is a load-management tool, which lets companies give
more lucrative campaigns priority status. “If I can only put out so many calls from
my call center today, I want to make sure I make the most profitable ones,” said
David Raab at the analyst firm Raab Associates. “This feature isn’t present in many
competing products,” he said.
IBM’s DecisionEdge campaign management module is designed to help businesses
personalize marketing messages and pass them to clients through direct mail, tele-
marketing, and face to face interactions. The product works with IBM’s Intelligent
Miner for Relationship Marketing.
Among the software’s features is a load-management tool, which lets companies give
more lucrative campaigns priority status. “If I can only put out so many calls from
my call center today, I want to make sure I make the most profitable ones,” said
David Raab at the analyst firm Raab Associates. “This feature isn’t present in many
competing products,” he said.
Business-to-business/
channel, inventory, and
supply chain manage-
ment
The Zurich Insurance Group, a global, Swiss-based insurer, uses data mining to ana-
lyze broker performance in order to increase the efficiency and effectiveness of its
business-to-business channel. Its primary utility is to look at broker performance rel-
ative to past performance and to predict future performance.

Supply chains and inventory management are expensive operational overheads. In
terms of sales and sales forecasting price is only one differentiator. Others include
product range and image, as well as the ability to identify trends and patterns ahead
of the competition. A large European retailer, using a data warehouse and data min-
ing tools, spotted an unexpected downturn in sales of computer games. This was
before Christmas. The retailer canceled a large order and watched the competition
stockpile unsold computer games before Christmas.
Superbrugsen, a leading Danish supermarket chain, uses data mining to optimize
every single product area, and product managers must therefore have as much rele-
vant information as possible to assist them when negotiating with suppliers to obtain
the best prices.
Marks and Spencer use customer profiling to determine what messages to send to
certain customers. In the financial services area, for example, data mining is used to
determine the characteristics of customers who are most likely to respond to a credit
offer.
Table 1.1 Illustrative Data Mining Best Practices Drawn from Media Reports (continued)
1.3 Benefits of data mining 17
Chapter 1
Market research, prod-
uct conceptualization
Blue Cross/Blue Shield is one of the largest health care providers in the United States.
The organization provides analysts financial, enrollment, market penetration, and
provider network information. This yields enrollment, new product development,
sales, market segment, and group size estimates for marketing and sales support.
Located in Dallas, TX, Rapp Collins is the second largest market research organiza-
tion in the United States. It provides a wide range of marketing-related services. One
involves applications that measure the effectiveness of reward incentive programs.
Data mining is a core technology used to identify the many factors that influence
attraction to incentives.
J. D. Power and Associates, located in Augora Hills, CA, produce a monthly forecast

of car and truck sales for about 300 different vehicles. Their specialty is polling the
customer after the sale regarding the purchase experience and the product itself. Fore-
casts are driven by sales data, economic data, and data about the industry. Data min-
ing is used to sort through these various classes of data to produce effective
forecasting models.
Product development,
engineering and quality
control
Quality management is a significant application area for data mining. In the manu-
facturing area the closer that a defect is detected to the source of the defect the eas-
ier—and less costly—it is to fix. So there is a strong emphasis on measuring progress
through the various steps of manufacturing in order to find problems sooner rather
than later. Of course, this means huge amounts of data are generated on many, many
measurement points. This is an ideal area for data mining:
 Hewlett-Packard has used data mining to sort out a perplexing problem with a
color printer that periodically produced fuzzy images. It turned out the problem
was in the alignment of the lenses that blended the three primary colors to pro-
duce the output. The problem was caused by variability in the glue curing process
that only affected one of the lens. Data mining was used to find which lens, under
what curing circumstances, produced the fuzzy printing resolution.
 R. R. Donnelley and Sons is the largest printing company in the United States.
Their printing presses include rollers that weigh several tons and spit out results at
the rate of 1,000 feet per minute. The plant experienced an occasional problem
with the print quality, caused by a collection of ink on the rollers called “band-
ing.” A task force was struck to find the cause of the problem. One of the task
force members, Bob Evans, used data mining to sort through thousands of fields
of data related to press performance in order to find a small subset of variables
that, in combination, could be used to predict the banding problem. His work is
published in the February 1994 issues of IEEE Expert and the April 1997 issue of
Database Programming & Design.

Table 1.1 Illustrative Data Mining Best Practices Drawn from Media Reports (continued)
18 1.4 Microsoft’s entry into data mining
1.4 Microsoft’s entry into data mining
Obviously, data mining is not just a back-room, scientific type of activity
anymore. Just as document preparation software and row/column–oriented
workbooks make publishers and business planners of us all, so too are we
sitting on the threshold of a movement that will bring data mining—inte-
grated with OLAP—to the desktop. What is the Microsoft strategy to
achieve this?
Microsoft is setting out to solve three perceived problems:
1. Data mining tools are too expensive.
2. Data mining tools are not integrated with the underlying data-
base.
3. Data mining algorithms, in general, reflect their scientific roots
and, while they work well with small collections of data, do not
scale well with the large gigabyte- and terabyte-size databases of
today’s business environment.
Microsoft’s strategy to address these problems revolves around three
thrusts:
1. Accessibility. Make complex data operations accessible and avail-
able to nonprofessionals, by generalizing the accessibility and low-
ering the cost.
2. Seamless reporting. Promote access and usability by providing a
common data reporting paradigm through simple to complex
business queries.
3. Scalability. To ensure access to data operations across increasingly
large collections of data, provide an integration layer between the
data mining algorithms and the underlying database.
Integration with the database engine occurs in three ways:
1. Preprocessing functionality is done in the database, thus provid-

ing native database access to sophisticated and heretofore special-
ized data cleaning, transforming, and preparation facilities.
2. Provide a core set of data mining algorithms directly in the data-
base and provide a broadly accessible application programming
interface (API) to ensure easy integration of external data mining
algorithms.
1.5 Concept of operations 19
Chapter 1
3. Provide a deployment mechanism to ensure that modeling results
can be readily built into other applications—both on the server
and on the desktop—and to break down business process barriers
to effective data mining results utilization.
Figure 1.3 shows the development of the current Microsoft architectural
approach to data mining, as Microsoft migrated from the SQL Server 7
release to the SQL Server 2000 release.
One message from this figure is that data mining, as with OLAP and ad
hoc reports before it, is just another query function—albeit a rather super
query. Whereas in the past an end user might ask for a sales by region
report, in the Microsoft world of data mining the query now becomes:
Show me the main factors that were driving my sales results last period. In
this way, one query can trigger millions—even trillions—of pattern match-
ing and search operations to find the optimal results. Often many results
will be produced for the reader to view. However, before long, many reader
models of the world will be solicited and presented—all in template style—
so that more and more preprocessing will take place to ensure that the
appropriate results are presented for display (and to cut down on the
amount of pattern searching and time required to respond to a query).
1.5 Concept of operations
As can be seen in Figure 1.3, the data mining component belongs to the DB
query engine (DMX expressions). With the growth—depth and breadth—

of data sources, it is clear that data mining algorithmic work belongs on the
Figure 1.3
SQL Server
development path
for data mining
Data
SQL Server 7
OLE DB for DM (data mining)
(DMX data mining expressions)
Segmentation
(Clustering)
Prediction Cross-sell
MDX
for OLAP
SQL Server 2000Commerce Server
Analysis Services
20 1.5 Concept of operations
server (shown in the figure as Commerce Server). We can also see that the
core data mining algorithms include segmentation capabilities and associ-
ated description and prediction facilities and cross-selling components. This
particular thrust has a decidedly e-commerce orientation, since cross-sell,
prediction, and segmentation are important e-commerce customer relation-
ship management functions.
Whatever algorithms are not provided on board will be provided
through a common API, which extends the OLE DB for data access con-
vention to include data mining extensions.
The Socrates project, formed to develop the Microsoft approach to data
mining, is a successor to the Plato Group (the group that built the
Microsoft OLAP services SQL Server 7 functionality). Together with the
Database Research Group, they are working on data mining concepts for

the future. Current projects this group is looking at include the following:
 It is normal to view the database or data warehouse as a data snap-
shot, frozen in time (the last quarter, last reporting period, and so
on). Data change through time, however, and this change requires the
mining algorithms to look at sequential data and patterns.
 Most of the world’s data are not contained as structured data but as
relatively unstructured text. In order to harvest the knowledge con-
tained in this source of data, text mining is required.
 There are many alternative ways of producing segmentations. One of
the most popular is K-means clustering. Microsoft is also exploring
other methods—based on expectation maximization—that will pro-
vide more reliable clusters than the popular K-means algorithms.
 The problem of scaling algorithms to apply data mining to large data-
bases is a continuing effort. One area—sufficiency statistics—seeks to
find optimal ways of computing the necessary pattern-matching rules
so that the rules that are discovered are reliable across the entire large
collection of data.
 Research is underway on a general data mining query language
(DMQL). This is to devise general methods within the DBMS query
language to form data mining queries. Current development efforts
focus on SQL operators Unipivot and DataCube.
 There are continuing efforts regarding OLAP refinements in the
direction of data mining to continue integration of OLAP and data
mining.
1.5 Concept of operations 21
Chapter 1
 A promising area of data mining is to define methods and procedures
to continue to automate more and more of the searching that is
undertaken automatically. This area of metarule-guided mining is a
continuing effort in the Socrates project.

This Page Intentionally Left Blank
23
2
The Data Mining Process
We are drowning in information but starving for knowledge.
—John Naisbett
In the area of data mining, we could say we are drowning in algorithms but
too often lack the ability to use them to their full potential. This is an
understandable situation, given the recent introduction of data mining into
the broader marketplace (also bearing in mind the underlying complexity of
data mining processes and associated algorithms). But how do we manage
all this complexity in order to reap the benefits of facilitated extraction of
patterns, trends, and relationships in data? In the modern enterprise, the job
of managing complexity and identifying, documenting, preserving, and
deploying expertise is addressed in the discipline of knowledge manage-
ment. The area of knowledge management is addressed in greater detail in
Chapter 7. The goal of this chapter is to present both the scientific and the
practical, profit-driven sides of data mining so as to form a general picture
of the knowledge management issues regarding data mining that can bridge
and synergize these two key components to the overall data mining project
delivery framework.
In the context of data mining, knowledge management is the collection,
organization, and utilization of various methods, processes, and procedures
that are useful in turning data mining technology into business, social, and
economic value. Data miners began to recognize a role for knowledge man-
agement in data mining as early as 1995, when, at a conference in Mon-
treal, they coined the term Knowledge Discovery in Databases (KDD) to
24 2.1 Best practices in knowledge discovery in databases
describe the process of providing guidance, methods, and procedures to
extract information and knowledge from data. This development provides

us with an understanding of an important distinction: the distinction
between data mining—the specific algorithms and algorithmic approaches
that are used to detect trends, patterns, and relationships in data—and
Knowledge Discovery in Databases (KDD)—the set of skills, techniques,
approaches, processes, and procedures (best practices) that provides the pro-
cess management context for the data mining engagement.
Knowledge discovery methods are often very general and include proc-
esses and procedures that apply regardless of the specific form of the data
and regardless of the particular algorithm that is applied in the data mining
engagement. Data mining tools, techniques, and approaches are much
more specific in nature and are often related to specific algorithms, forms of
data, and data validation techniques. Both approaches are necessary for a
successful data mining engagement.
2.1 Best practices in knowledge discovery
in databases
Since its conception in 1995, KDD has continued to serve as a conduit for
the identification and dissemination of best practices in the adaptation and
deployment of algorithms and approaches to data mining tasks. KDD is
thought of as a scientific discipline, and the KDD conferences themselves
are thought of as academic and scientific exchanges. So access to much of
what the KDD has to offer assumes a knowledge and understanding of aca-
demic and scientific methods, and this, of course, is not always present in
business settings (e.g., scientific progress depends on the free, objective, and
open sharing of knowledge—the antithesis of competitive advantage in
business). On the other hand, business realities are often missing in aca-
demic gatherings. So a full understanding of the knowledge management
context surrounding data mining requires an appreciation of the scientific
methods that data mining and knowledge discovery are based on, as well as
an understanding of the applied characteristics of data mining engagements
in the competitive marketplace.

As a knowledge management discipline, what does KDD consist of ?
KDD is strongly rooted in the scientific tradition and incorporates state-of-
the-art knowledge developed through a series of KDD conferences and
industry working groups that have been wrestling with the knowledge man-
agement issues in this area over the last decade. KDD conference partici-
pants, as well as KDD vendors, propose similar knowledge management
2.2 The scientific method and the paradigms that come with it 25
Chapter 2
approaches to describe the KDD process. Two of the most widely known
(and well-documented) KDD processes are the SEMMA process, developed
and promoted by the SAS Institute (), and the CRISP-DM
process, developed and promoted by a consortium of data mining consum-
ers and vendors that includes such well-known companies as Mercedes-
Benz and NRC Corporation ( />At this point Microsoft does not appear to have developed a KDD proc-
ess, nor have they endorsed a given approach. Much of their thinking on
this is reflected in the various position papers contained on their data min-
ing Web site ( In addition, quite a bit of
thinking is also captured in the commentary surrounding the OLE DB for
data mining standards. All approaches to data mining depend heavily on a
knowledge of the scientific method, which embodies one of the oldest, best
documented, and useful practices available today. An understanding of the
scientific method, particularly the concepts of sampling, measurement, the-
ories, hypotheses and paradigms, and, most certainly, statistics, is implied in
all data mining and knowledge discovery methodologies. A general, high-
level treatment of the scientific method, as a data mining best practice, fol-
lows. A discussion of the specific statistical techniques that are most popu-
larly used in data mining applications is taken up in later chapters, which
review the application of these methods to business problem solving.
2.2 The scientific method and the paradigms that
come with it

I’d wager that very few people who are undertaking a data mining engage-
ment for the first time think of themselves as scientists approaching a scien-
tific study. It is useful, possibly essential, to bring a scientific approach into
data mining, however, since whenever we look at data and how data can be
used to reflect and model real-world events we are implicitly adopting a sci-
entific approach (with an associated rule book that we are well advised to
become familiar with).
The scientific method contains a wide, well-developed, and well-docu-
mented system of best practices, which have performed a central role in the
evolution of the current scientific and technological civilization as we know
it. While we may take much of science and engineering for granted, we
know either explicitly or intuitively that these developments would not have
been possible without a scientific discipline to drive the process. This disci-
pline, which reserves a central place for the role of data in order to measure,
test, and promote an understanding of real-world events, operates under the
26 2.2 The scientific method and the paradigms that come with it
covers of any data mining and KDD system. The scientific method plays
such a central role—either explicitly or implicitly—that needs to be recog-
nized and understood in order to fully appreciate the development of KDD
and data mining solutions.
An excellent introduction to the scientific method is given in Abraham
Kaplan’s The Conduct of Enquiry. In a world where paradigm shift has
entered the popular lexicon, it is also certainly worth noting the work of
Thomas Kuhn, author of The Structure of Scientific Revolutions. Kaplan
describes the concept of theory advancement through tests of hypotheses.
He shows that you never really prove a hypothesis, you just build ever-
increasing evidence and detailed associated explanations, which provide
support for questions or hypotheses and which, eventually, provide an over-
all theory.
The lesson for data miners is this: we never actually “prove” anything in

a data mining engagement—all we do is build evidence for a prevailing view
of the world and how it operates and this evidence is constrained to the
view of the world that we maintain for purposes of engaging in the data
mining task. This means that facts gain certainty over time, since they show
themselves to be resistant to dis-proof. So you need to build a knowledge
store of facts and you need to take them out and exercise them with new
bits of data from time to time in order to improve their fitness. Data min-
ing—like science itself—is fundamentally cumulative and iterative, so store
and document your results.
There is another lesson: Facts, or evidence, have relevance only within
the context of the view of the world—or business model—in which they are
contained. This leads us to Thomas Kuhn.
Kuhn is the originator of the term paradigm shift. As Kaplan indicates, a
set of hypotheses, when constructed together, forms a theory. Kuhn suggests
that this theory, as well as associated hypotheses, is based on a particular
model, which serves as a descriptive or explanatory paradigm for the theory.
When the paradigm changes, so too does everything else: hypotheses, the-
ory, and associated evidence. For example, a mechanistic and deterministic
description of the universe gave way to a new relativistic, quantum concept
of the universe when Einstein introduced a new paradigm to account for
Newton’s descriptions of the operations of the universe. In a Newtonian
world, a falling object is taken as evidence for the operation of gravity. In
Einstein’s world, there is no gravity—only relative motion in a universe that
is bent back upon itself. Just as Newton’s paradigm gave way to Einstein’s, so
too did Keplar’s paradigm (the sun as the center of the universe) give way to
Newton’s.
2.2 The scientific method and the paradigms that come with it 27
Chapter 2
What does this have to do with data mining and knowledge discovery?
Today we are moving into a new business paradigm. In the old paradigm,

business was organized into functional areas—marketing, finance, engineer-
ing—and a command and control system moved parts or services for manu-
facture through the various functional areas in order to produce an output
for distribution or consumption. This paradigm has changed to a new, cus-
tomer-centric paradigm. Here the customer is the center of the business,
and the business processes to service customer needs are woven seamlessly
around the customer to perceive and respond to needs in a coordinated,
multidisciplinary and timely manner with a network of process feedback
and control mechanisms. The data mining models need to reflect this busi-
ness paradigm in order to provide value. So just as experimental methods
and associated explanatory and descriptive models changed in the scientific
world to support Einstein’s view of the universe, so too do knowledge dis-
covery methods and associated explanatory and descriptive models need to
change to support a customer-centric view of business processes.
2.2.1 Creating an operational model of the paradigm
in the data mining engagement
At the start of the data mining engagement it is important to be clear about
the business process that is being modeled as well as the underlying para-
digm. Our paradigm will serve as a world view, or business model. How
does this work?
Say we have a hunch, or hypothesis, that customers develop loyalty over
time. We may not know what the factors are that create loyalty, but, as firm
believers of the scientific method, we intuitively understand that if we can
get the data in place we can construct a scientific experiment and use data
mining techniques to find the important factors. We might draw inspira-
tion from some early work conducted by pioneers in the field of science—
for example, tests to verify and validate a somewhat revolutionary concept
at the time or the concept that air has mass (i.e., it is not colorless, weight-
less, etc.). To test the concept that air has mass we form a hypothesis.
Figure 2.1 illustrates the process of testing a hypothesis. This hypothesis

is based on an “air has mass” paradigm. So, if air has mass, then it has
weight, which, at sea level, would press down on a column of mercury (liq-
uid poured into a glass tube with a bottom on it). If air has mass, then, as I
move away from sea level by walking up a mountain, for example, the
weight of air should be less and less. I can test this hypothesis, empirically,
by using data points (measurements of the height of mercury as I move up
28 2.2 The scientific method and the paradigms that come with it
the mountain). Of course, at the end of this experiment, having measured
the height of the column of mercury as I walk up the mountain, I will have
collected evidence to support my theory. If this were applied science (and
try to see how it is), then I would have a report, support for the theory, and
an action plan ready for executive approval based on the findings that are
supported by data and a solid experimental process based on the scientific
method.
In the case of customer loyalty, my paradigm is based on the concept
that customers interact with the business provider. Over time, some interac-
tions lead to loyalty while other interactions lead to the opposite—call it
disloyalty. Call this an interaction-based paradigm for customer relationship
modeling.
So what is the associated hypothesis? Well, if I am right—and the data
can confirm this—then long-time customers will behave differently from
short-term customers. A “poor” customer interaction, which will lead short-
term customers to defect, will not have the same outcome with a long-term
customer.
How do I test this hypothesis? As with the mountain climbing measure-
ments for the air mass experiment, I need to begin with a model. We might
begin with a napkin drawing, as illustrated in Figure 2.2. From the model I
will form hypotheses—some, such as customer recruitment will lead to cus-
tomer interaction, are trivial. Others, such as customer interactions may
lead to loyalty or defections and that this outcome may depend on the time

of the interaction, are rich in possibilities. To test a hypothesis I need data.
In this case I will need to assemble a data set that has customer time of
Figure 2.1
The scientific
method—testing a
hypothesis
Air has mass
Mercury column
rises with trip up
mountain
Height, altitude
Plot
Barometer
Hypothesis
Measurement
Assessment
Action
2.2 The scientific method and the paradigms that come with it 29
Chapter 2
service measurements (tenure, or length of time, as a customer). I will also
need some indicator of interactions (e.g., number of calls at the call center,
type of call, service requests, overdue payments, and so on). I will also need
to construct an indicator of defection on the data set. This means that I
need customer observations through time and I need an indicator of defec-
tion. Once I do this, I will have an indicator of tenure, an interaction indi-
cator, and a defection indicator on my data set.
The test of my hypothesis is simple: All things considered, I expect that
a complaint indicator for newer customers will be associated with more
defections than would be the case with long-term customers. The business
advice in this simple explanation is correspondingly simple: Pay more atten-

tion to particular kinds of customer complaints if the customer is a new-
comer! As with the scientific experiment discussed previously, we are now in
a position to file a report with a recommendation that is based on fact, as
illustrated through empirical evidence collected from data and exploited
using peerless techniques based on the scientific method. Not bad. In the
process we have used the idea of forming a paradigm and associated hypoth-
eses and tests as a way to provide guidance on what kind of data we need,
how we need to reformat or manipulate the data, and even how we need to
guide the data mining engine in its search for relevant trends and patterns.
Figure 2.2
A model as
hypothesis—
customer loyalty
Time
Value
Recruit
Customer
Interact with
Customer
Loyalty
Defection
30 2.3 How to develop your paradigm
All this adds up to a considerable amount of time saved in carrying out the
knowledge discovery mission and lends considerable credibility in the
reporting and execution of the associated results. These are benefits that are
well worth the effort. (See Figure 2.2.)
2.3 How to develop your paradigm
All scientific, engineering, and business disciplines promote and propose
conceptual models that describe the operation of phenomena from a given
point of view. Nowadays, in addition to the napkin, it seems that the uni-

versal tool for visualizing these models is the whiteboard or a Powerpoint
slide. But what are the universal mechanisms for collecting the key concep-
tual drivers that form the model in the first place?
A number of interesting and promising model development techniques
have emerged out of the discipline of the Balanced Scorecard (Robert S.
Kaplan and David P. Norton). Other techniques, originally inspired by W.
Edwards Deming, have emerged from the field of quality management.
The search for quality, initially in manufacturing processes and now in
business processes in general, has led to the development of a number of
effective, scientifically based, and time-saving techniques, which are excep-
tionally useful for the data mining and knowledge discovery practitioner.
Many best practices have been developed in the area of quality management
to help people—and teams of people—to better conceptualize the problem
space they are working in. It is interesting to note that W. Edwards Deming
is universally acknowledged as the father of quality management. Deming
was a statistician who, after World War II, transformed manufacturing pro-
cesses forever through the introduction of the scientific method and associ-
ated statistical testing procedures in the service of improving the
manufacturing process. Quality management best practices are discussed in
many sources. One useful discussion and summary is found in Management
for Quality Improvement: The 7 New QC Tools by Mizuno.
One such best practice is a team brainstorming practice, which results in
the development of an issues and drivers relations diagram, as illustrated in
Figure 2.3. This diagram is a facilitating mechanism, useful to tap the group
memory—and any available documented evidence—in order to develop a
preliminary concept of all the relevant factors that could drive the under-
standing and explanation of a particular data mining solution.
The issues and drivers diagram shows which drivers are likely to be asso-
ciated with a given issue and—importantly—it shows, in a preliminary
2.3 How to develop your paradigm 31

Chapter 2
manner, what the relationships between the drivers and issues appear to be.
The arrows and connections show not only relationships but the direction
of the presumed relationships.
The issues and drivers diagram is an important tool, especially when it is
used to tap into the group memory and problem-solving ability. It provides
the knowledge that is relevant in constructing the conceptual model or par-
adigm, which will later serve to drive the selection of data for the data min-
ing solution as well as the search for relationships in data to characterize
data mining trends, patterns, and statistical associations displayed in the
construction of the data mining solution.
One other useful diagram, once again drawn from the area of quality
assurance, is the “fish bone,” or Ishikawa diagram (so named in honor of its
original developer, Kaoru Ishikawa, a Japanese disciple of W. Edwards
Deming) (see Figure 2.4).
In outlining the issues and drivers diagram it will usually become appar-
ent that many of the drivers can be classified together in line with some
Figure 2.3
Developing the
paradigm—issues
and drivers
relationship
diagram
Figure 2.4
An example of the
Ishikawa diagram
Problem 1
Problem 2
Factor 1
2

3
4
5
6
7
Problem 3
Customer
Account
Household
Purchase
Behavior
32 2.3 How to develop your paradigm
conceptual similarity. In the examination of customer purchasing behavior,
for example, we may find that purchases depend upon such issues as dis-
count rate, timing, frequency, and channel of the offer, credit instrument,
and so on. Drivers may include such considerations as customer status (e.g.,
new, elite), customer attributes (e.g., gender, occupation, home owner),
purchase behavior (e.g., quantity and frequency of purchase), and so on.
The Ishikawa diagram is very useful in grouping these drivers together—as
a common class or driving factor—as one of the unique branches (or main
“bone,” if using the fish bone metaphor) drawn at an angle from the main
“spine” of the diagram.
2.3.1 Operationalization—turning concepts
into measurements
All empirical techniques begin with decidedly nonempirical components:
The analysis will end up sitting in an area with a numerical (empirical)
basis, but it will be guided by a theoretical model or paradigm of some kind.
Darwin’s Theory of Natural Selection, for example, serves as an orienting
framework to understand the adaptation of various new forms of life to var-
ious ecosystems. Economists would have similar theories to understand cur-

rency shifts and product success and failure. The trick is to convert
thoughts or concepts—analytical constraints—into operational measures
that can be manipulated symbolically by the data mining engine embedded
in software.
Analytical constructs become empirically rooted when a specific data-
base measurement of the construct is adopted. For example, dollars spent,
as measured on the data set, can serve as an operational measure of the
Figure 2.5
Relationship
between concepts
(analytical
constructs) and
data (operational
measures)
Operational
Measure
Analytical
Construct
Analytical
Construct
Analytical
Construct
Operational
Measure
2.3 How to develop your paradigm 33
Chapter 2
“price” construct in our economic model. (See Figure 2.5.) Our biological
model could use “number of offspring” as a measure of fitness for survival.
Within the theoretical framework we need to narrow in on a manageable
research question to serve as a specific case in point within the overall the-

ory. Some questions will be amenable to empirical testing and some won’t
(this depends heavily on what data are available and how well the data can
serve as an empirical measurement of the question under consideration—
for example, does dollars spent serve as a good measure of price?). The final
analytical model will be that set of testable hypotheses set up for empirical
verification that can be examined in order to move the exploration of the
research question forward.
To take Darwin’s “survival of the fittest” paradigm, as shown in Figure
2.6, we would need to begin with the conceptual pieces of the paradigm,
such as “gene pool candidates,” “stressors,” and “mating and species propa-
gation.” These processes, taken as components of a generalized process of
“natural selection,” would show that the “survival of the fittest” paradigm
produces a better “adapted, improved species.” The terms in quotes repre-
sent objects or actions. We can put them together to construct the concep-
tual description of the operation of the paradigm. In examining the
interoperation of these constructs, through a process of analysis, we can
generate and test hypotheses that relate to the presumed operation of the
paradigm. These “analytical constructs” then become central to our ability
to test and refine our paradigm so as to shed more light on the operation of
the process (in this case natural selection).
Figure 2.6
Example paradigm
Gene Pool Candidates
Natural
Selection
Stressors
Mating and Species
Propagation
Adapted, “Improved” Species
34 2.3 How to develop your paradigm

Since modern science is an empirical science, we need real, objectively
verifiable measurements to represent our analytical constructs, and, eventu-
ally, when we form a hypothesis, we will test the hypothesis, using scientific
tests of significance, on the data points, or measurements, that we have
taken to represent the analytical constructs.
So, for “gene pool candidates” and “adapted, improved species” we will
have such relevant measurements as average lifetime, running speed, resis-
tance to disease, and, for example, adaptability to various diets. We will
have stressors, such as attacks by predators, feast/famine cycles, weather
variations, and so on. For mating, propagation measures will have mating
episodes and number of offspring.
A simple hypothesis would be that the more the stress, the more the
mating–offspring episodes that better the base capability of life indicators
over successive generations. We expect, for example, that successor genera-
tions will run faster than earlier generations (assuming, in this simple exam-
ple, that fleeing from a predator was a stressor response). If we can confirm
this finding—in this case by reference to empirical measurements of run-
ning speed—then we take this as evidence in support of our paradigm.
2.3.2 Beyond hypotheses—developing the
analytical model
Now we have seen how a general conceptual paradigm can help in the
development of the data mining question, or hypothesis. The examination
of the flow and direction of the relationships taken from the issues and driv-
ers diagram can help us look at the direction, sequence, and timing of the
drivers and classes of drivers, or factors, in the issues and drivers analysis.
The sequence of events is critically important in building an analytical
model of behavior. Not all relationships in data make sense from a sequen-
tial point of view. For example, in the data mining analysis you may want to
explore the relationship between age and amount purchased. You will prob-
ably find that as age increases so too does the purchase amount (if for no

other reason than, in general, as age increases so too do earning power and
disposable income). So the sequence of the analysis would be age  income
 purchase amount. Unless your analytical model explicitly supports this
type of sequential view, you may find yourself in the uncomfortable posi-
tion of building a model like this: purchase amount  income  age or
even, income  purchase amount  age. We can see how increases in age
lead to increases in income and this, in turn, leads to increases in purchases.
But we will never be successful in showing how increases in purchase
2.3 How to develop your paradigm 35
Chapter 2
amounts lead to increases in income, which, in turn, lead to increases in
age. So the examination of sequence is central to the construction of a good
analytical model to reflect the business behavior that you are looking at and
is essential to identifying the correct questions or hypotheses to explore.
To consider a more complex example, examine the construction of a
model that predicts precipitation. As shown Figure 2.7, rainfall on the west-
ern coast depends upon a variety of factors: from evaporation rate over the
ocean, to prevailing winds, to the land mass on the coast (particularly varia-
tions caused by mountain ranges, for example), and such local factors as
particle concentration in the atmosphere. It only makes sense to look at the
relationship between evaporation rate and precipitation if you consider, in
appropriate sequence, the intervening effects of prevailing wind, land mass,
and particle concentration. The examination of intervening effects (factors
that appear between two related elements in a model) and predisposing
effects (prior factors that act on the related elements) is essential in identify-
ing the appropriate relationships to examine in an analysis.
The sequence A  intervening effect  B is a legitimate analysis path.
Beware, however, of looking at the relationship between A and B without
looking at the intervening effects. This could lead to a misspecified model
that asserts, for example, that low evaporation leads to precipitation on the

coast by failing to take into consideration the lagged effects of the interven-
ing variables.
Looking at a relationship without examining intervening and predispos-
ing effects can lead to the identification of spurious relationships—that is,
the relationships may be strong but incorrect (you may find a strong rela-
tionship between low evaporation and high precipitation but only because
Figure 2.7
An example model
showing the
direction of effects
Evaporation Prevailing
Winds
Land mass Particle
Concentration
.
.

.
.
.
.
.
36 2.3 How to develop your paradigm
this ignores the intervening operation—through time—of prevailing winds
that contain the water-logged atmosphere that actually fueled the precipita-
tion at the time the measurements were taken).
The sequence A (predisposing effect)  B  C provides an example of
the operation of a spurious relationship. You may observe a relationship
between elevation and precipitation. Simplistically, you may assert that ele-
vation “causes” precipitation, as if water leaked out of rocks as air mass

decreases. This is a tenable hypothesis, but the empirical validation is flawed
because you have neglected to include the predisposing (earlier sequence)
affect of evaporation rate over the true source of water (the ocean).
The Ishikawa diagram, originally shown in Figure 2.4, can help us in
defining sequences of relationships. Say, for example, that the issue or out-
come we are looking at is a determinant of customer purchases. Our issues
and drivers diagram may have suggested an Ishikawa diagram. Here, drivers
tend to group in customer, household, account, and behavior groupings.
Customer groupings include such data as age, gender, marital status, and,
typically, when other data sources can be tapped, indicators of income, and
even educational or occupational attributes. Household indicators may
include such measurements as type of household, own or rent status, and,
potentially, number of children in the household and even kind and num-
ber of appliances (electronic or otherwise). Account status may include type
of account, number of accounts, account payment indicators, activation
date, and length of service. Behavior may include such indicators as date of
purchase, purchase time or location, and such information as quantity of
purchase, price, and, possibly, discount rate.
In developing the Ishikawa diagram it is useful to arrange the general
concepts and specific data points that belong to the factors in order, moving
from left to right on the diagram. This results in another diagram, shown in
Figure 2.8
Example Ishikawa
diagram
illustrating the flow
of effect and
components of the
descriptive model of
purchase behavior
User

Characteristics
Behavioral
Predispositions
User
Affinities
Buyer-
Seller
Relationship
Gender
Age
Peer group
Life cycle
Purchase type
Stated
preference
Customer type
Tenure
Purchase
Style
Browsing patterns
Buying
metrics
2.4 The data mining process methodology 37
Chapter 2
Figure 2.8. Here you can readily see why these diagrams are sometimes
referred to as “fish bone diagrams.”
2.4 The data mining process methodology
A number of best practice methodologies have emerged to provide guidance
on carrying out a data mining undertaking. Two of the most widely known
methodologies are the CRISP-DM methodology and the SEMMA meth-

odology. CRISP-DM stands for Cross-Industry Standard Practice for Data
Mining. It has been promoted as an open, cross-industry standard and has
been developed and promoted by a variety of interests and data mining ven-
dors, including Mercedes-Benz and NCR Corporation. SEMMA is a pro-
prietary methodology, which was developed by the SAS Institute. SEMMA
stands for Sample, Explore, Modify, Model, and Assess. There are many
other methodologies that have been proposed and promoted; many of them
are contained in the various books that have been written about data min-
ing. (See, for example, Data Mining Techniques [Berry and Linoff].) Most
can be seen to be themes and variations of the CRISP-DM methodology.
2.4.1 CRISP-DM Process
The CRISP-DM methodology is a multinational, standards-based ap-
proach to describe, document, and continuously improve data mining (and
associated data warehousing, business intelligence) processes. The CRISP-
DM framework identifies six steps in the data mining process, as shown in
the following list. I have added a seventh step—performance measure-
ment—to capture the closed-loop characteristic of a virtuous cycle of con-
tinuous process improvement through successive plan, analyze, implement,
measure iterations of data mining projects.
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment
7. Performance measurement
38 2.4 The data mining process methodology
The process is illustrated in Figure 2.9, which illustrates a number of
important characteristics of data mining:
1. It is, at its core, a top-down, goal-driven process: Everything

hinges on the definition of the business goal to be accomplished.
2. It is a closed-loop process: everything flows into an assessment
step, which, in turn, flows back into the redefinition (and reexe-
cution) of the goal-setting phase. This closed-loop nature of the
process has been dubbed “the virtuous cycle” by many observers
(prominently, Michael Berry and Gordon Linoff in their popular
treatment of data mining: Data Mining Techniques). The closed-
loop cycle tells us that there is no virtue in a one-off, “quick hits”
approach to data mining. True value comes over time with succes-
sive refinements to the data mining goal execution task.
3. The methodology is not a linear process: There are many feed-
back loops where successive, top-down refinements are interwo-
ven in the successful closed-loop engagement.
Figure 2.9 An example of the CRISP-DM framework
Business
Understanding
Database
Evaluation
Modeling
Data
Preparation
Data
Understanding
Deployment
Performance
Measurement

×