Lecture 3 numerical statistics

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.49 MB, 8 trang )

Lecture 4
DESCRIPTIVE STATISTICS:
Numerical summaries

BUSINESS STATISTICS
Advanced Educational Program

Reading materials:
Chap 4 (Keller)

1

Outline

2

Measure of center and spread

¥  Measures of center:
-  Mean, median, mode
-  Selection of measures of location

¥  Measures of dispersion (spread):
-  Range, quartile range, quartile deviation,
variance, standard deviation

¥  Empirical rule (general case: ChebyshevÕs
law)
¥  Coefficient of variation
¥  Coefficient of skewness
3

Measures of center

4

Measures of center
¥  A measure of center or location shows
where the center of the data is
Ơ Three most useful measures of location:
Đ Arithmetic mean/average
§  Median
§  Mode

5

6

Arithmetic mean from frequency table

Arithmetic mean from raw data
N

¥  Arithmetic mean from population:

X
à=

i

Ơ Apply this formula for the sample:

i =1

N

k

x f

n

¥  Arithmetic mean from sample:

i i

∑x

i

x=

x=

i =1

i =1
k

∑f

n

i

i =1

Where:

Xi, xi - the value of each item
N, n - total number of items

Where: xi - the value of class i
fi Ð frequency of class i

7

8

Mean is sensitive to outliers

Advantages and disadvantages of arithmetic mean
¥  Advantages:
Ð  Easy to understand and calculate
Ð  Values of every items are included => representative for
the whole set of data

¥  Disadvantages
Ð  Sensitive to outliers:
Sample: (43; 38; 37; : : : ; 27; 34): =>

x = 33.5
Contaminated sample
(43; 38; 37; : : : ; 27; 1934): =>
x = 71.5

9

10

Median

Calculate median from raw data
¥ 

  Median is the value of the observation which is
located in the middle of the data set
 

If the data has an odd number of observations:
(n + 1)th
Ð  Middle observation:
2

Steps to find median:

Median = x ( n +1)th

1.  Arrange the observations in order of size (normally
ascending order)

2

¥ 

2.  Find the number of observations and hence the middle
observation

If the data has an even number of observations:
Ð 

There are two observations located in the middle and

Median = ( x

3.  The median is the value of the middle observation

th

⎛n⎞
⎜ ⎟
⎝2⎠

11

+x

⎛n ⎞
⎜ +1⎟
⎝2 ⎠

th

)/2

12

Example

Advantages and disadvantages of median
¥  Advantages:

¥ 

E.g1. Raw data: 11, 11, 13, 14, 17 => find median

¥ 

E.g 2. Raw data: 11, 11, 13, 14, 16, 17 => find
median

Ð  Easy to understand and calculate
Ð  Not affected by outlying values => thus can be used
when the mean would be misleading

¥  Disadvantages
Ð  Value of one observation => fails to reflect the whole
data set
Ð  Not easy to use in other analysis
13

14

Mode
Example to calculate mode
¥ 
¥ 

Mode is the value which occurs most
frequently in the data set
Steps to find mode
1.  Draw a frequency table for the data
2.  Identify the mode as the most frequent value

15

Frequency

8

3

12

7

16

12

17

8

19

5

16

Mean, median and mode in normal and skewed
distributions

Bimodal and multimodal data

Bimodal (two modes)

X

Multimodal (several modes)
17

18

Which measure of centre is best?

Measures of dispersion (variability)

¥  Mean generally most commonly used

¥  Sensitive to extreme values
¥  If data skewed/extreme values present, median better, e.g.
real estate prices
¥  Mode generally best for categorical data Ð e.g. restaurant
service quality (below): mode is very good. (ordinal)
Rating

# customers

Excellent

20

Very good

50

Good

30

Satisfactory

12

Poor

10

Very Poor

6

¥ 

Measures of dispersion tell you how spread
out all other values of the distribution from
the central tendency
Measures of dispersion

¥ 
¥ 

The range, quartile range, and quartile deviation

¥ 

Variance and standard deviation

19

Why do we need measures of dispersion?

20

Why measures of dispersion? (1)

¥  Two data sets of midterm marks of 5 students:
Ð  First set: 100, 40, 40, 35, 35 => Mean: 50
Ð  Second set: 70, 55, 50, 40, 35 => Mean: 50

Ø Which mean (first or second) is more reliable?

¥  Need to know the spread of other values around the
central tendency, especially important in analysing
stock market.

21

22

Range

Why measures of dispersion? (2)

¥  Range is the difference between the largest and
smallest value => Sort data before computing range
¥  Formula: Range = maximum value - minimum
value
¥  Advantages of Range: easy to calculate for
ungrouped data.
¥  Disadvantages:
Ð  Take into account only two values
Ð  Affected by one or two extreme values
Ð  More difficult to calculate for grouped data
23

24

Quartiles

Quartile range and quartile deviation

¥  Quartiles: are defined as values of observations
which are a quarter of the way through data

¥  Quartile range = Q3 Ð Q1

Ð  Q1 - the first quartile: the value of the
observation of which 25% of observations fall
below

¥  Quartile deviation =

Ð  Q2 - the second quartile: the median (50% of the
observations fall below)

¥  Advantages of quartile deviation (semi-interquartile range):
less affected by extreme value

Q3 − Q1
2

¥  Disadvantages: take into account only 50% of the data

Ð  Q3 - the third quartile: the value of the
observation of which 75% of observations fall
below
25

26

Variance
¥  Variance from population:
¥  Variance from sample

Standard deviation (σ )

2 =

s2 =

( X i à )2

Ơ Standard deviation (S.D) is the square root of variance
¥  S.D from population:

N

∑ ( x − x)

2

σ = σ2

n −1
¥  S.D from sample:

¥  Advantages:
¥  Take into account all values

¥  Easy to interpret the result.

s = s2

¥  Advantages:
¥  Overcome the disadvantage of meaningless unit of
variance
¥  The most widely used measure of dispersion (the bigger
its value => the more spread out are the data)

¥  Disadvantages: the unit of variance has no meaning

27

Application of this in finance
¥  Variance (or S.D) of an investment, can be used
as a measure of risk e.g. on profits/return.
¥  Larger variance è larger risk
¥  Usually, higher rate of return, higher risk

28

Example Ð 2 funds over 10 years (1)
¥  Rates of return
A

8.3 -6.2 20.9 -2.7 33.6 42.9 24.4 5.2

3.1 30.5

B 12.1 -2.8 6.4 12.2 27.8 25.3 18.2 10.7 -1.3 11.4

x A = 16%

xB = 12%

s A2 = 280.34(%) 2

s A2 = 99.37(%) 2

¥  Which fund will you invest?

Empirical rules or the law of 3 σ

Example Ð 2 funds over 10 years (2)

¥  For a normal or symmetrical distribution:
l 

Ð  68.26% of all obs fall within 1 standard deviation of the
mean, i.e. in the range:

Depending on how Risk-averse you are:
Fund A: higher risk, but also higher average rate
of return.

( x − 1s) ↔ ( x + 1s)
Ð  95.45% of all obs fall within 2 standard deviation of the
mean, i.e. in the range:

( x − 2s) ↔ ( x + 2s)
Ð  99.73% of all obs fall within 3 standard deviation of the
mean, i.e. in the range:

( x − 3s ) ↔ ( x + 3s )
32

Meaning of the law of 3σ

Boxplot

¥  Convert z-score to probability (next lecture)
Here is the Boxplot of height of international students
studying at UNSW

¥  Identify outliers

Boxplot of Height
200

whisker
190

upper quartile

Height

180

170

median

box

lower quartile

160

whisker
150

33

34

Boxplots

Shapes of Boxplots

¥  Need MEDIAN and QUARTILES to create a boxplot
¥  MEDIAN = middle of observations, i.e. ! way through
observations
¥  QUARTILES = mark quarter points of observations, i.e. "
(Q1) and # (Q3) of the way through data [(n+1)/4; 3(n+1)/
4]
¥  INTERQUARTILE RANGE = Q3-Q1
¥  Whiskers: max length is 1.5*IQR; stretch from box to
furthest data point (within this range)

¥  Points further out from box marked with stars; called
outliers

Boxplot of Symmetric, Positive skew, Negative skew, Bimodal
5.0

¥  Skewness/
symmetry
¥  Modality
¥  Range

Data

2.5

0.0

-2.5

-5.0
Symmetric

35

Positive skew

Negative skew

Bimodal

36

Coefficient of skewness (C of S)

Activity 1
¥  Summary statistics of two data sets are as follows

¥  This measures the shape of distribution
¥  There are some measures of skewness.
¥  Below is a common one: PearsonÕs coefficient of skewness.
Coefficient of skewness = 3 x (mean-median)/standard
deviation
¥  If C of S is nearly +1 or -1, the distribution is highly skewed
¥  If C of S is positive => distribution is skewed to the right
(positive skew)

n 

¥  If C of S is negative => distribution is skewed to the left
(negative skew)

Set 1:
Ages of students
studying at UNSW

Set 2:
Wages of staffs
294.3

Mean

22.4839

Median

21

292.5

Standard deviation

6.3756

125.93

Compute the PearsonÕs coefficient of skewness of these data
sets and describe their shapes of distribution

37

38

Investigating the relationship between variables

Distribution shapes

¥  Methods:

6

Frequency

4

100

o Multiple bar chart
o Scatterplot (mentioned in lecture 8)

2

50

0

0

Frequency

150

8

200

10

Ð  Table: Cross-table
Ð  Charts:

20

40

60
age

Skewed to the right

80

100

200

300

400

500

600

wages

Nearly normal
39

Cross-table

Cross-table

¥  Cross-table is used to investigate the relationship
b/w two categorical vars or discrete variables with
few values.

¥  EX: use gss.sav data file to explore the
relationship b/w internet use and degree

¥  Note:
Ð  Need to identify dependent and independent variables.
Ð  Know how to calculate row and column percentages
Ð  Rule of thumb: independent var in row and dependent
var in column

41

42

Multiple bar chart

Multiple bar chat
Here you are

¥  We can use multiple bar chart to explore the
relationship b/w variables.
¥  The skill is to know how to draw chart
¥  EX: use gss.sav data file to explore the

relationship b/w internet use, age, and degree

43

44

Lecture 3 numerical statistics

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về