Tải bản đầy đủ (.pdf) (48 trang)

Lecture Business management information system - Lecture 26: Data mining

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (638.89 KB, 48 trang )

Data Mining
Lecture 26


Today’s Lecture
What is data mining?
 Why data mining?
 What applications?
 What techniques?
 What process?
 What software?


2


Definition
Data mining may be defined as follows:
data mining is a collection of techniques for efficient
automated discovery of previously unknown, valid, novel,
useful and understandable patterns in large databases.
The patterns must be actionable so they may be used in
an enterprise’s decision making.

3


What is Data Mining?








Efficient automated discovery of previously unknown
patterns in large volumes of data.
Patterns must be valid, novel, useful and understandable.
Businesses are mostly interested in discovering past
patterns to predict future behaviour.
A data warehouse, as discussed earlier, is an enterprise’s
memory. Data mining can provide intelligence using that
memory.

4


Examples






amazon.com uses associations. Recommendations to
customers are based on past purchases and what other
customers are purchasing.
A store in USA “Just for Feet” has about 200 stores, each
carrying up to 6000 shoe styles, each style in several
sizes. Data mining is used to find the right shoes to stock in
the right store.

More examples in case studies to be discussed later.

5


Data Mining






We assume we are dealing with large data, perhaps
Gigabytes, perhaps in Terabytes.
Although data mining is possible with smaller amount of
data, bigger the data, higher the confidence in any
unknown pattern that is discovered.
There is considerable hype about data mining at the
present time and Gartner Group has listed data mining as
one of the top ten technologies to watch.

Question: How many books could one store in one Terabyte of memory?
6


Why Data Mining Now?






Growth in generation and storage of corporate data –
information explosion
Need for sophisticated decision making – current
database systems are Online Transaction Processing
(OLTP) systems. The OLTP data is difficult to use for
such applications. Why?
Evolution of technology – much cheaper storage, easier
data collection, better database management, to data
analysis and understanding.

7


Information explosion






Database systems are being used since the 1960s in
the Western countries (perhaps since 1980s in India).
These systems have generated mountains of data.
Point of sale terminals and bar codes on many
products, railway bookings, educational institutions,
huge number of mobile phones, electronic commerce,
all generate data.
Government is now collecting a lot of information.


8


Information explosion







Internet banking via networked computers and ATMs.
Credit and debit cards.
Medical data, doctors, hospitals.
Transportation, Indian railways, automatic toll collection
on toll roads, growing air travel.
Passports, NRI visas, Other visas, NRI money
transfers.

Question: Can you think of other examples of data collection?

9


Information explosion
Many adults in India generate:
 Mobile phone transactions. More than 300 million phones
in India, reportedly growing at the rate of 10,000 new
ones every hour! Mobile companies must save
information about calls.

 Growing middle class with growing number of credit and
debit card transactions. About 25m credit cards and 70m
debit cards in 2007. Annual growth rate about 30% and
40% respectively. Could be 55m credit cards and 200m
debit cards in 2010 resulting in perhaps 500m
transactions annually.

10


Information explosion






India has some huge enterprises, for example Indian
railways, perhaps the busiest network in the world with
2.5m employees, 10,000 locomotives, 10,000 passenger
trains daily, 10,000 freight trains daily and 20m
passengers daily.
Growing airline traffic with more than ten airlines. Perhaps
30m passengers annually.
Growing number of motor vehicles – registration,
insurance, driver license
Internet surfing records

11



OLTP
As noted earlier, most enterprise database systems were 
designed in the 1970’s or 1980’s and were mainly 
designed to automate some of the office procedures e.g. 
order entry, student enrolment, patient registration, 
airline reservations. These are well structured repetitive 
operations easily automated.

12


Decision Making







Need for business memory and intelligence.
Need to serve customers better by learning from past
interactions.
OLTP data is not a good basis for maintaining an
enterprise memory.
The intelligence hidden in data could be the secret
weapon in a competitive business world but given the
information explosion not even a small fraction could be
looked at by human eye.


Question: Why OLTP is not good for maintaining an enterprise memory?
13


OLTP vs Decision Making
Clerical view of data focuses on details required for 
day­to­day running of an enterprise.
Management view of data focuses on summary data to 
identify trends, challenges and opportunities.
The detailed data view is the operational view while 
the management view is decision­support view. 
Comparison of the two views:

14


Operational vs Management View
Operational

Decision­Support

Users – Admin  staff

Users – Management

Day–to–day work

Decision support

Application oriented


Subject oriented

Current data

Historical data

Detailed

Overall view – summaries

Simple queries

Complex queries

Predetermined queries

Ad hoc queries

Update/Select

Only Select

Real–time

Not real–time
15


Evolution of Technology





Corporate data growth accompanied by decline in the
cost of storage and processing.
PC motherboard performance, measured in MHz/$, is
currently doubling every 27 ± 2 months.
Next slide using logarithmic scale shows that disk is now
about 10GB per US dollar and the following slide shows
that sales of disk storage is growing exponentially.

Question: How much is the cost of 100GB disk? What is the cost of a PC and what is its
CPU performance?

16


Decline in Hard Drive cost

17


Growth in Worldwide Disk Capacity
18000

Storage in Petabytes

16000
14000

12000
10000
8000
6000
4000
2000
0
1996

1997

1998

1999

2000

2001

2002

2003

Year

18


Evolution of Technology


Question: What do the graphs in the last two slides tell us? What scales are used in
them? What was the pink line is the first graph?

19


Evolution of Technology




Database technology has improved over the years.
Data collection is often much better and cheaper now
The need for analyzing and synthesizing information is
growing in a fiercely competitive business environment
of today.

20


New applications
Sophisticated applications of modern enterprises include:
­ sales forecasting and analysis
­ marketing and promotion planning
­ business modeling
OLTP is not designed for such applications. Also, large 
enterprises operate a number of database systems and 
then it is necessary to integrate information for decision 
making applications.
Question: Why OLTP cannot be used for sales forecasting and analysis? 


21


Why Data Mining Now?
As noted earlier, the reasons may be summarized as:
•Accumulation of large amounts of data
• Increased affordable computing power enabling data 
mining processing
• Statistical and learning algorithms
• Availability of software
• Strong business competition

22


Large amount of data
Already discussed that many enterprises have large 
amounts of data accumulated over 30+ years.
Noted earlier that some enterprises collect information 
for analysis, for example, supermarkets in USA offer 
loyalty cards in exchange for shopper information. 
Loyalty cards in Australia also collect information 
using a reward system.

23


Growth of cards
A recent survey in USA found that the percentages of

US adults using the following types of cards were:
 Credit cards - 88%;
 ATM cards - 60%
 Membership cards - 58%
 Debit cards - 35%
 Prepaid cards - 35%
 Loyalty cards - 29%
Question: What kind of data do these cards generate?

24


Affordable computing power
Data mining is usually computationally intensive. 
Dramatic reduction in the price of computer systems, 
as noted earlier, is making it possible to carry out 
data mining without investing huge amounts of 
resources in hardware and software.
In spite of affordable computing power, using data 
mining can be resources intensive.

25


×