D
a
t
a
M
i
n
i
n
g
Data Mining
Hanu
Education and Research Dept
Infosys Technologies Limited,
Mangalore
D
a
t
a
M
i
n
i
n
g
Overview of Data Mining
D
a
t
a
M
i
n
i
n
g
Problem
•
You are a marketing manager for a cellular
telephone company
•
Problem: Churn is too high
•
Turnover (after contract expires) is 40%
•
Customers receive a free phone (cost: $125)
with contract
•
You pay a sales commission of $250 per
contract
•
Giving a new phone to everyone whose
contract is expiring is very expensive (as well
as wasteful)
•
Bringing back a customer after quitting is
both difficult and expensive
D
a
t
a
M
i
n
i
n
g
Solution
•
Three months before a contract expires,
predict which customers will leave
•
If you want to keep a customer that is
predicted to churn, offer them a new phone
•
The ones that are not predicted to churn need
no attention
•
If you don’t want to keep the customer, do
nothing
How can you predict future behavior?
•
Parrot Cards?
•
Magic Ball?
•
Data Mining?
D
a
t
a
M
i
n
i
n
g
Definition
•
Data Mining is the process of discovering
meaningful new corelations, patterns and
trends by sifting through large amounts of
data stored in repositories and by using
pattern recognition technologies as well as
statistical and mathematical techniques
-The Gartner Group
•
In Simple words - The automated extraction of
predictive information from large databases
D
a
t
a
M
i
n
i
n
g
Goal of Data mining
•
Simplification and automation of the overall statistical
process
•
Replace statistician
D
a
t
a
M
i
n
i
n
g
Evolution of Data usage Technology
•
1960s - Data Collection
What was my total revenue in the last five
years?
•
1980s - Relational Database
What were unit sales in California last March?
•
1990s - OLAP
What were unit sales in California last March?
Drill down to San Jose
D
a
t
a
M
i
n
i
n
g
Data Mining
•
Emerging Today
What’s likely to happen to San Jose unit sales
next month? Why?
D
a
t
a
M
i
n
i
n
g
Functionality Chart
D
a
t
a
M
i
n
i
n
g
Architecture of Data Mining
D
a
t
a
M
i
n
i
n
g
Data Mining is
•
Neural Networks
•
Rule Induction
•
Nearest Neighbor
•
Genetic Algorithms
D
a
t
a
M
i
n
i
n
g
Data Mining is NOT
•
Ad Hoc Query / Reporting
•
Online Analytical Processing (OLAP)
•
Data Visualization Software Agents
D
a
t
a
M
i
n
i
n
g
Now it is affordable
•
Massive computational power of new
Computer
•
Huge Database
•
Proven statistical Models
More than any thing……
Business compulsion
D
a
t
a
M
i
n
i
n
g
Technology is just one element
•
Collect Data
•
Organize Data
•
Turn model into action DM
•
Integration Usability
D
a
t
a
M
i
n
i
n
g
Target Application
•
Marketing /Finance
•
Direct mail
•
Customer acquisition/attrition
•
Fraud detection/Credit scoring
Technology sometimes depends on the
vertical sector
•
Credit Scoring: Neural Networks
•
Direct Mail: CHAID
D
a
t
a
M
i
n
i
n
g
Typical analysis
•
Classification / Segmentation
•
Binary (Yes/No)
•
Multiple category (Large/Medium/Small)
Forecasting (how much)
•
Association rule extraction (market basket
analysis)
•
Sequence detection -balance increase - missed
payment - default
D
a
t
a
M
i
n
i
n
g
Modeling
D
a
t
a
M
i
n
i
n
g
Nearest neighbor method
A technique that classifies each
record in a dataset based on a
combination of the classes of the k
record(s) most similar to it in a
historical dataset .
Sometimes called the k-nearest
neighbor technique.
Very easy to implement.
More difficult to use in
production.
Disadvantage: Huge Models
D
a
t
a
M
i
n
i
n
g
Artificial neural
networks
Non-linear predictive models that
learn through training and
resemble biological neural
networks in structure.
Difficult to understand
Relationship between weights and
variables is complicated
Significant pre-processing of data
often required
D
a
t
a
M
i
n
i
n
g
Rule induction
The extraction of useful if-then rules from data
based on statistical significance
If Car = Ford and Age = 30…40 Then Defaults =
Yes If Age = 25…35 and Prior_purchase = No
Then Defaults = No Weight = 3.7 Weight =
1.2
D
a
t
a
M
i
n
i
n
g
Decision Tree
•
Tree-shaped structures that represent sets of
decisions.
•
These decisions generate rules for the
classification of a data set.
•
Specific decision tree methods include
Classification and Regression Trees (CART)
and Chi Square Automatic Interaction
Detection (CHAID)
D
a
t
a
M
i
n
i
n
g
Other Algos
•
Genetic Algorithms (More of a search
technique than a data mining algorithm)
•
Rough sets
•
Bayesian networks
•
Mixture models
•
Many more
D
a
t
a
M
i
n
i
n
g
Wish list of application
Database integration
Automated Model Scoring
Exporting Models to Other Applications
Business Templates
Incorporate Financial Information
Computed Target Columns Time-Series Data
Wizards
D
a
t
a
M
i
n
i
n
g
References
•
/>•
/>•
/>ercial
•
•
www.Microsoft.com/sqlserver
•
Data Mining Techniques
- Michael J A Berry, Gordon Linoff