67$7,67,&$/$1'
(&2120(75,&
0(7+2'6)25
75$163257$7,21
'$7$$1$/<6,6
6(&21'(',7,21
67$7,67,&$/$1'
(&2120(75,&
0(7+2'6)25
75$163257$7,21
'$7$$1$/<6,6
6(&21'(',7,21
6LPRQ3:DVKLQJWRQ
0DWWKHZ*.DUODIWLV
)UHG/0DQQHULQJ
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2011 by Taylor and Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number-13: 978-1-4200-8286-9 (Ebook-PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com ( or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
and the CRC Press Web site at
To David, Judy, Karen, Tracy, Devon, and Samantha
Simon
To Amy, George, John, Stavriani, Nikolas
Matt
To Jill, Willa, and Freyda
Fred
Contents
Preface .....................................................................................................................xv
Part I
Fundamentals
1. Statistical Inference I: Descriptive Statistics ............................................3
1.1 Measures of Relative Standing ............................................................3
1.2 Measures of Central Tendency ............................................................4
1.3 Measures of Variability ........................................................................5
1.4 Skewness and Kurtosis ........................................................................9
1.5 Measures of Association .................................................................... 11
1.6 Properties of Estimators ..................................................................... 14
1.6.1 Unbiasedness .......................................................................... 14
1.6.2 Efficiency ................................................................................. 15
1.6.3 Consistency ............................................................................. 16
1.6.4 Sufficiency ............................................................................... 16
1.7 Methods of Displaying Data ............................................................. 17
1.7.1 Histograms ............................................................................. 17
1.7.2 Ogives ...................................................................................... 18
1.7.3 Box Plots .................................................................................. 19
1.7.4 Scatter Diagrams .................................................................... 19
1.7.5 Bar and Line Charts............................................................... 20
2. Statistical Inference II: Interval Estimation, Hypothesis Testing
and Population Comparisons..................................................................... 25
2.1 Confidence Intervals ........................................................................... 25
2.1.1 Confidence Interval for µ with Known σ 2 ......................... 26
2.1.2 Confidence Interval for the Mean with Unknown
Variance ................................................................................... 28
2.1.3 Confidence Interval for a Population Proportion.............. 28
2.1.4 Confidence Interval for the Population Variance .............. 29
2.2 Hypothesis Testing .............................................................................30
2.2.1 Mechanics of Hypothesis Testing........................................ 31
2.2.2 Formulating One- and Two-Tailed Hypothesis Tests ..... 33
2.2.3 The p-Value of a Hypothesis Test ........................................ 36
2.3 Inferences Regarding a Single Population ...................................... 36
2.3.1 Testing the Population Mean with Unknown
Variance ................................................................................... 37
vii
Contents
viii
2.4
2.5
Part II
2.3.2 Testing the Population Variance .......................................... 38
2.3.3 Testing for a Population Proportion .................................... 38
Comparing Two Populations ............................................................. 39
2.4.1 Testing Differences between Two Means:
Independent Samples ............................................................ 39
2.4.2 Testing Differences between Two Means:
Paired Observations ..............................................................42
2.4.3 Testing Differences between Two Population
Proportions .............................................................................43
2.4.4 Testing the Equality of Two Population Variances ........... 45
Nonparametric Methods .................................................................... 46
2.5.1 Sign Test .................................................................................. 47
2.5.2 Median Test............................................................................. 52
2.5.3 Mann–Whitney U Test .......................................................... 52
2.5.4 Wilcoxon Signed-Rank Test for Matched Pairs ................. 55
2.5.5 Kruskal–Wallis Test ............................................................... 56
2.5.6 Chi-Square Goodness-of-Fit Test ......................................... 58
Continuous Dependent Variable Models
3. Linear Regression .........................................................................................63
3.1 Assumptions of the Linear Regression Model ...............................63
3.1.1 Continuous Dependent Variable Y ......................................64
3.1.2 Linear-in-Parameters Relationship between Y and X ......64
3.1.3 Observations Independently and Randomly
Sampled ...................................................................................65
3.1.4 Uncertain Relationship between Variables ........................65
3.1.5 Disturbance Term Independent of X and Expected
Value Zero ...............................................................................65
3.1.6 Disturbance Terms Not Autocorrelated ............................. 66
3.1.7 Regressors and Disturbances Uncorrelated....................... 66
3.1.8 Disturbances Approximately Normally Distributed ....... 66
3.1.9 Summary................................................................................. 67
3.2 Regression Fundamentals.................................................................. 67
3.2.1 Least Squares Estimation...................................................... 69
3.2.2 Maximum Likelihood Estimation ....................................... 73
3.2.3 Properties of OLS and MLE Estimators.............................. 74
3.2.4 Inference in Regression Analysis ........................................ 75
3.3 Manipulating Variables in Regression ............................................. 79
3.3.1 Standardized Regression Models ........................................ 79
3.3.2 Transformations .....................................................................80
3.3.3 Indicator Variables ................................................................. 82
Contents
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
ix
Estimate a Single Beta Parameter .....................................................83
Estimate Beta Parameter for Ranges of a Variable .........................83
Estimate a Single Beta Parameter for m – 1 of the m Levels of
a Variable ..............................................................................................84
3.6.1 Interactions in Regression Models ......................................84
Checking Regression Assumptions ................................................. 87
3.7.1 Linearity .................................................................................. 88
3.7.2 Homoscedastic Disturbances ...............................................90
3.7.3 Uncorrelated Disturbances ................................................... 93
3.7.4 Exogenous Independent Variables ...................................... 93
3.7.5 Normally Distributed Disturbances ................................... 95
Regression Outliers ............................................................................. 98
3.8.1 The Hat Matrix for Identifying Outlying
Observations ........................................................................... 99
3.8.2 Standard Measures for Quantifying Outlier
Influence ................................................................................ 101
3.8.3 Removing Influential Data Points from the
Regression ............................................................................. 101
Regression Model GOF Measures .................................................. 106
Multicollinearity in the Regression ................................................ 110
Regression Model-Building Strategies ........................................... 112
3.11.1 Stepwise Regression ............................................................ 112
3.11.2 Best Subsets Regression ...................................................... 113
3.11.3 Iteratively Specified Tree-Based Regression .................... 113
Estimating Elasticities ...................................................................... 113
Censored Dependent Variables—Tobit Model.............................. 114
Box–Cox Regression.......................................................................... 116
4. Violations of Regression Assumptions.................................................. 123
4.1 Zero Mean of the Disturbances Assumption................................ 123
4.2 Normality of the Disturbances Assumption ................................ 124
4.3 Uncorrelatedness of Regressors and Disturbances
Assumption........................................................................................ 125
4.4 Homoscedasticity of the Disturbances Assumption ................... 127
4.4.1 Detecting Heteroscedasticity ............................................. 129
4.4.2 Correcting for Heteroscedasticity ..................................... 131
4.5 No Serial Correlation in the Disturbances Assumption ............. 135
4.5.1 Detecting Serial Correlation ............................................... 137
4.5.2 Correcting for Serial Correlation ....................................... 139
4.6 Model Specification Errors .............................................................. 142
5. Simultaneous-Equation Models .............................................................. 145
5.1 Overview of the Simultaneous-Equations Problem..................... 145
5.2 Reduced Form and the Identification Problem ............................. 146
Contents
x
5.3
Simultaneous-Equation Estimation ................................................ 148
5.3.1 Single-Equation Methods ................................................... 148
5.3.2 System-Equation Methods .................................................. 149
5.4 Seemingly Unrelated Equations ..................................................... 155
5.5 Applications of Simultaneous Equations to Transportation
Data ..................................................................................................... 156
Appendix 5A. A Note on GLS Estimation ............................................... 159
6. Panel Data Analysis ................................................................................... 161
6.1 Issues in Panel Data Analysis.......................................................... 161
6.2 One-Way Error Component Models............................................... 163
6.2.1 Heteroscedasticity and Serial Correlation ....................... 166
6.3 Two-Way Error Component Models............................................... 167
6.4 Variable-Parameter Models ............................................................. 172
6.5 Additional Topics and Extensions .................................................. 173
7. Background and Exploration in Time Series ........................................ 175
7.1 Exploring a Time Series ................................................................... 176
7.1.1 Trend Component ................................................................ 176
7.1.2 Seasonal Component ........................................................... 176
7.1.3 Irregular (Random) Component ........................................ 179
7.1.4 Filtering of Time Series ....................................................... 179
7.1.5 Curve Fitting......................................................................... 179
7.1.6 Linear Filters and Simple Moving Averages .................... 179
7.1.7 Exponential Smoothing Filters .......................................... 180
7.1.8 Difference Filter.................................................................... 185
7.2 Basic Concepts: Stationarity and Dependence.............................. 188
7.2.1 Stationarity............................................................................ 188
7.2.2 Dependence .......................................................................... 188
7.2.3 Addressing Nonstationarity .............................................. 190
7.2.4 Differencing and Unit-Root Testing .................................. 191
7.2.5 Fractional Integration and Long Memory ........................ 194
7.3 Time Series in Regression ................................................................ 197
7.3.1 Serial Correlation ................................................................. 197
7.3.2 Dynamic Dependence ......................................................... 197
7.3.3 Volatility ................................................................................ 198
7.3.4 Spurious Regression and Cointegration ........................... 200
7.3.5 Causality ............................................................................... 202
8. Forecasting in Time Series: Autoregressive Integrated Moving
Average (ARIMA) Models and Extensions ..........................................207
8.1 Autoregressive Integrated Moving Average Models ................... 207
8.2 Box–Jenkins Approach ..................................................................... 210
Contents
8.3
8.4
8.5
xi
8.2.1 Order Selection ..................................................................... 210
8.2.2 Parameter Estimation .......................................................... 212
8.2.3 Diagnostic Checking ........................................................... 213
8.2.4 Forecasting ............................................................................ 214
Autoregressive Integrated Moving Average Model
Extensions .......................................................................................... 218
8.3.1 Random Parameter Autoregressive Models .................... 219
8.3.2 Stochastic Volatility Models ...............................................222
8.3.3 Autoregressive Conditional Duration Models ................ 224
8.3.4 Integer-Valued ARMA Models .......................................... 224
Multivariate Models .........................................................................225
Nonlinear Models ............................................................................. 227
8.5.1 Testing for Nonlinearity ..................................................... 227
8.5.2 Bilinear Models .................................................................... 228
8.5.3 Threshold Autoregressive Models .................................... 229
8.5.4 Functional Parameter Autoregressive Models ................ 230
8.5.5 Neural Networks ................................................................. 231
9. Latent Variable Models ............................................................................. 235
9.1 Principal Components Analysis ..................................................... 235
9.2 Factor Analysis .................................................................................. 241
9.3 Structural Equation Modeling ........................................................ 244
9.3.1 Basic Concepts in Structural Equation Modeling ........... 246
9.3.2 Fundamentals of Structural Equation Modeling ............ 249
9.3.3 Nonideal Conditions in the Structural Equation
Model ..................................................................................... 251
9.3.4 Model Goodness-of-Fit Measures...................................... 252
9.3.5 Guidelines for Structural Equation Modeling ................. 255
10. Duration Models ......................................................................................... 259
10.1 Hazard-Based Duration Models ..................................................... 259
10.2 Characteristics of Duration Data .................................................... 263
10.3 Nonparametric Models .................................................................... 264
10.4 Semiparametric Models ................................................................... 265
10.5 Fully Parametric Models .................................................................. 268
10.6 Comparisons of Nonparametric, Semiparametric, and Fully
Parametric Models ............................................................................ 272
10.7 Heterogeneity .................................................................................... 274
10.8 State Dependence .............................................................................. 276
10.9 Time-Varying Covariates ................................................................. 277
10.10 Discrete-Time Hazard Models ........................................................ 277
10.11 Competing Risk Models................................................................... 279
Contents
xii
Part III
Count and Discrete Dependent Variable
Models
11. Count Data Models .................................................................................... 283
11.1 Poisson Regression Model ............................................................... 283
11.2 Interpretation of Variables in the Poisson Regression Model ....284
11.3 Poisson Regression Model Goodness-of-Fit Measures ................ 286
11.4 Truncated Poisson Regression Model ............................................ 290
11.5 Negative Binomial Regression Model ............................................ 292
11.6 Zero-Inflated Poisson and Negative Binomial Regression
Models ................................................................................................ 295
11.7 Random-Effects Count Models .......................................................300
12. Logistic Regression .................................................................................... 303
12.1 Principles of Logistic Regression .................................................... 303
12.2 Logistic Regression Model...............................................................304
13. Discrete Outcome Models ........................................................................309
13.1 Models of Discrete Data ...................................................................309
13.2 Binary and Multinomial Probit Models......................................... 310
13.3 Multinomial Logit Model ................................................................ 312
13.4 Discrete Data and Utility Theory ................................................... 316
13.5 Properties and Estimation of MNL Models .................................. 318
13.5.1 Statistical Evaluation ........................................................... 321
13.5.2 Interpretation of Findings .................................................. 323
13.5.3 Specification Errors .............................................................. 325
13.5.4 Data Sampling ...................................................................... 330
13.5.5 Forecasting and Aggregation Bias .................................... 331
13.5.6 Transferability ...................................................................... 333
13.6 Nested Logit Model (Generalized Extreme Value Models) ...... 334
13.7 Special Properties of Logit Models.................................................342
14. Ordered Probability Models ....................................................................345
14.1 Models for Ordered Discrete Data .................................................345
14.2 Ordered Probability Models with Random Effects ..................... 352
14.3 Limitations of Ordered Probability Models .................................. 358
15. Discrete/Continuous Models ................................................................... 361
15.1 Overview of the Discrete/Continuous Modeling Problem ........ 361
15.2 Econometric Corrections: Instrumental Variables and
Expected Value Method ................................................................... 363
15.3 Econometric Corrections: Selectivity-Bias Correction Term....... 366
15.4 Discrete/Continuous Model Structures ........................................ 368
15.5 Transportation Application of Discrete/Continuous Model
Structures ........................................................................................... 372
Contents
Part IV
xiii
Other Statistical Methods
16. Random-Parameter Models ...................................................................... 375
16.1 Random-Parameter Multinomial Logit Model (Mixed Logit
Model) ................................................................................................. 375
16.2 Random-Parameter Count Models ................................................. 381
16.3 Random-Parameter Duration Models ............................................384
17. Bayesian Models ......................................................................................... 387
17.1 Bayes’ Theorem ................................................................................. 387
17.2 MCMC Sampling–Based Estimation .............................................. 389
17.3 Flexibility of Bayesian Statistical Models via MCMC
Sampling–Based Estimation ............................................................ 395
17.4 Convergence and Identifiability Issues with MCMC
Bayesian Models ................................................................................ 396
17.5 Goodness-of-Fit, Sensitivity Analysis, and Model Selection
Criterion Using MCMC Bayesian Models ..................................... 399
Appendix A Statistical Fundamentals ....................................................... 403
A.1 Matrix Algebra Review .................................................................... 403
A.1.1 Matrix Multiplication ..........................................................404
A.1.2 Linear Dependence and Rank of a Matrix ....................... 406
A.1.3 Matrix Inversion (Division) ................................................ 406
A.1.4 Eigenvalues and Eigenvectors ............................................408
A.1.5 Useful Matrices and Properties of Matrices .................... 409
A.1.6 Matrix Algebra and Random Variables ............................ 410
A.2 Probability, Conditional Probability, and Statistical
Independence..................................................................................... 412
A.3 Estimating Parameters in Statistical Models—Least Squares
and Maximum Likelihood............................................................... 413
A.4 Useful Probability Distributions..................................................... 415
A.4.1 The Z Distribution ............................................................... 416
A.4.2 The t Distribution ................................................................ 417
A.4.3 The x 2 Distribution .............................................................. 418
A.4.4 The F Distribution................................................................ 419
Appendix B
Glossary of Terms .................................................................... 421
Appendix C
Statistical Tables ...................................................................... 459
Appendix D Variable Transformations ...................................................... 483
D.1 Purpose of Variable Transformations ............................................ 483
D.2 Commonly Used Variable Transformations..................................484
D.2.1 Parabolic Transformations ..................................................484
Contents
xiv
D.2.2
D.2.3
D.2.4
D.2.5
Hyperbolic Transformations .............................................. 485
Exponential Functions ........................................................ 485
Inverse Exponential Functions .......................................... 487
Power Functions ................................................................... 488
References ........................................................................................................... 489
Index ..................................................................................................................... 511
Preface
Transportation is integral to developed societies. It is responsible for personal mobility, which includes access to services, goods, and leisure. It is also
a key element in the delivery of consumer goods. Regional, state, national,
and the world economies rely upon the efficient and safe functioning of
transportation facilities.
Besides the sweeping influence transportation has on economic and social
aspects of modern society, transportation issues pose challenges to professionals across a wide range of disciplines, including transportation engineers, urban and regional planners, economists, logisticians, psychologists,
systems and safety engineers, social scientists, law enforcement and security professionals, and consumer theorists. Where to place and expand transportation infrastructure; how to safely and efficiently operate and maintain
infrastructure; and how to spend valuable resources to improve mobility
and access to goods, services, and health care are among the decisions made
routinely by transportation-related professionals.
Many transportation-related problems and challenges involve stochastic processes that are influenced by observed and unobserved factors in
unknown ways. The stochastic nature of transportation problems is largely
a result of the role that people play in transportation. Transportation system
users are routinely faced with decisions in contexts such as what transportation mode to use, which vehicle to purchase, whether to participate in a
vanpool or telecommute, where to relocate a business, whether to support
a proposed light-rail project, and whether to utilize traveler information
before or during a trip. These decisions involve various degrees of uncertainty. Transportation system managers and governmental agencies face
similar stochastic problems in determining how to measure and compare
system measures of performance, where to invest in safety improvements,
how to efficiently operate transportation systems, and how to estimate transportation demand.
As a result of the complexity, diversity, and stochastic nature of transportation problems, the analytical toolbox required of the transportation analyst must be broad. This book describes and illustrates some of the tools
commonly used in transportation data analysis. Every book must achieve
a balance between depth and breadth of theory and applications, given the
intended audience. This book targets two general audiences. First, it serves
as a textbook for advanced undergraduate, masters, and Ph.D. students in
transportation-related disciplines, including engineering, economics, urban
and regional planning, and sociology. There is sufficient material to cover
two three-unit semester courses in analytical methods. Alternatively, a onesemester course could consist of a subset of topics covered in this book.
xv
xvi
Preface
The publisher’s Web site, www.crcpress.com, contains the datasets used to
develop this book so that applied-modeling problems will reinforce the modeling techniques discussed throughout the text. To facilitate teaching from
this text, the Web site also contains Microsoft PowerPoint® presentations for
each of the chapters in the book. These presentations, new to the second edition, will significantly improve the adoptability of this text for college, university, and professional instructors.
The book also serves as a technical reference for researchers and practitioners wishing to examine and understand a broad range of analytical
tools required to solve transportation problems. It provides a wide breadth
of transportation examples and case studies covering applications in various aspects of transportation planning, engineering, safety, and economics.
Sufficient analytical rigor is provided in each chapter so that fundamental
concepts and principles are clear and numerous references are provided for
those seeking additional technical details and applications.
Part I of the book provides statistical fundamentals (Chapters 1 and 2).
This section is useful for refreshing fundamentals and for sufficiently preparing students for the following sections.
Part II of the book presents continuous dependent variable models. The
chapter on linear regression (Chapter 3) devotes additional pages to introduce common modeling practice—examining residuals, creating indicator
variables, and building statistical models—and thus serves as a logical starting chapter for readers new to statistical modeling. The subsection on Tobit
and censored regressions is new to the second edition. Chapter 4 discusses
the impacts of failing to meet linear regression assumptions and presents
corresponding solutions. Chapter 5 deals with simultaneous equation models and presents modeling methods appropriate when studying two or more
interrelated dependent variables. Chapter 6 presents methods for analyzing
panel data—data obtained from repeated observations on sampling units
over time, such as household surveys conducted several times to a sample of
households. When data are collected continuously over time, such as hourly,
daily, weekly, or yearly, time series methods and models are often needed
and are discussed in Chapters 7 and 8. New to the second edition is explicit
treatment of frequency domain time series analysis, including Fourier and
wavelets analysis methods. Latent variable models, discussed in Chapter 9,
are used when the dependent variable is not directly observable and is
approximated with one or more surrogate variables. The final chapter in this
section, Chapter 10, presents duration models, which are used to model timeuntil-event data as survival, hazard, and decay processes.
Part III in the book presents count and discrete dependent variable models.
Count models (Chapter 11) arise when the data of interest are nonnegative
integers. Examples of such data include vehicles in a queue and the number
of vehicle crashes per unit time. Zero inflation—a phenomenon observed
frequently with count data—is discussed in detail, and a new example and
corresponding data set have been added in this second edition. Logistic
Preface
xvii
regression commonly used to model probabilities of binary outcomes, is presented in Chapter 12, and is unique to the second edition. Discrete outcome
models are extremely useful in many study applications, and are described
in detail in Chapter 13. A unique feature of the book is that discrete outcome
models are first considered statistically, and then later related to economic
theories of consumer choice. Ordered probability models (a new chapter for
the second edition) are presented in Chapter 14. Discrete/continuous models
are presented in Chapter 15 and demonstrate that interrelated discrete and
continuous data need to be modeled as a system rather than individually,
such as the choice of which vehicle to drive and how far it will be driven.
Finally, Part IV of the book contains new chapters on random-parameter
models (Chapter 16) and Bayesian statistical modeling (Chapter 17). Randomparameter models are starting to gain wide acceptance across many fields of
study, and this chapter provides a basic introduction to this exciting newer
class of models. The chapter on Bayesian statistical models arises from
the increasing prevalence of Bayesian inference and Markov Chain Monte
Carlo methods (an analytically convenient method for estimating complex
Bayes’ models). This chapter presents the basic theory of Bayesian models, of
Markov Chain Monte Carlo methods of sampling, and presents two separate
examples of Bayes’ models.
The appendices are complementary to the remainder of the book.
Appendix A presents fundamental concepts in statistics, which support analytical methods discussed. Appendix B is an alphabetical glossary of statistical terms that are commonly used and provides a quick and easy reference.
Appendix C provides tables of probability distributions used in the book,
while Appendix D describes typical uses of data transformations common
to many statistical methods.
While the book covers a wide variety of analytical tools for improving the
quality of research, it does not attempt to teach all elements of the research
process. Specifically, the development and selection of research hypotheses,
alternative experimental design methodologies, the virtues and drawbacks
of experimental versus observational studies, and issues involved with the
collection of data are not discussed. These issues are critical elements in the
conduct of research, and can drastically impact the overall results and quality of the research endeavor. It is considered a prerequisite that readers of
this book are educated and informed on these critical research elements to
appropriately apply the analytical tools presented herein.
Simon P. Washington
Matthew G. Karlaftis
Fred L. Mannering
Part I
Fundamentals
1
Statistical Inference I: Descriptive Statistics
This chapter examines methods and techniques for summarizing and
interpreting data. The discussion begins with an examination of numerical descriptive measures. These measures, commonly known as point estimators, support inferences about a population by estimating the values of
unknown population parameters using a single value (or point). The chapter
also describes commonly used graphical representations of data. Relative
to graphical methods, numerical methods provide precise and objectively
determined values that are easily manipulated, interpreted, and compared.
They permit a more careful analysis of data than more general impressions
conveyed by graphical summaries. This is important when the data represent a sample from which population inferences must be made.
While this chapter concentrates on a subset of basic and fundamental issues
of statistical analyses, there are countless thorough introductory statistical
textbooks that can provide the interested reader with greater detail. For example, Aczel (1993) and Keller and Warrack (1997) provide detailed descriptions
and examples of descriptive statistics and graphical techniques. Tukey (1977)
is a classic reference on exploratory data analysis and graphical techniques.
For readers interested in the properties of estimators (Section 1.7), the books
by Gujarati (1992) and Baltagi (1998) are excellent, mathematically rigorous,
sources.
1.1 Measures of Relative Standing
A set of numerical observations can be ordered from smallest to largest
magnitude. This ordering allows the boundaries of the data to be defined
and supports comparisons of the relative position of specific observations.
Consider the usefulness of percentile rank in terms of measuring driving
speeds on a highway section. In this case, a driver’s speed is compared to the
speeds of all drivers who drove on the road segment during the measurement period and the relative speed positioned within the group is defined in
terms of a percentile. If, for example, the 85th percentile of speed is 63 mph,
then 85% of the sample of observed drivers was driving at speeds below
63 mph and 20% were above 63 mph. A percentile is defined as that value
below which lies P% of the values in the remaining sample. For sufficiently
3
4
Statistical, Econometric Methods: Transportation Data Analysis
large samples, the position of the Pth percentile is given by (n + 1)P/100,
where n is the sample size.
Quartiles are the percentage points that separate the data into quarters:
first quarter, below which lies one quarter of the data, making it the 25th percentile; second quarter, or 50th percentile, below which lies half of the data;
third quarter, or 75th percentile point. The 25th percentile is often referred
to as the lower or first quartile, the 50th percentile as the median or middle
quartile, and the 75th percentile as the upper or third quartile. Finally, the
interquartile range (IQR), a measure of the data spread, is defined as the
numerical difference between the first and third quartiles.
1.2 Measures of Central Tendency
Quartiles and percentiles are measures of the relative positions of points
within a given data set. The median constitutes a useful point because it lies
in the center of the data, with half of the data points lying above and half
below the median. The median constitutes a measure of the “centrality” of
the observations, or central tendency.
Despite the existence of the median, by far the most popular and useful
measure of central tendency is the arithmetic mean, or, more succinctly, the
sample mean or expectation. The sample mean is another statistical term
that measures the central tendency, or average, of a sample of observations.
The sample mean varies across samples and thus is a random variable. The
mean of a sample of measurements x1, x2, . . . , xn is defined as
∑
MEAN (X) = E [X] = X =
n
i =1
n
xi
(1.1)
where n is the size of the sample.
When an entire population is examined, the sample mean X is replaced
by µ, the population mean. Unlike the sample mean, the population mean is
constant. The formula for the population mean is
µ=
∑
N
i= 1
N
xi
(1.2)
where N is the size of the population.
The mode (or modes because it is possible to have more than one) of a
set of observations is the value that occurs most frequently, or the most
commonly occurring outcome, and strictly applies to discrete variables
Statistical Inference I
5
(nominal and ordinal scale variables) as well as count data. Probabilistically,
it is the most likely outcome in the sample; it is observed more than any other
value. The mode can also be a measure of central tendency.
There are advantages and disadvantages of each of the three central tendency measures. The mean uses and summarizes all of the information in
the data is a single numerical measure, and has some desirable mathematical properties that make it useful in many statistical inference and modeling applications. The median, in contrast, is the central most (center) point
of ranked data. When computing the median, the exact locations of data
points on the number line are not considered; only their relative standing
with respect to the central observation is required. Herein lies the major
advantage of the median; it is resistant to extreme observations or outliers
in the data. The mean is, overall, the most frequently applied measure of
central tendency; however, in cases where the data contain numerous outlying observations the median may serve as a more reliable measure. Robust
statistical modeling approaches, much like the median, are designed to be
resistant to the influence of extreme observations.
If sample data are measured on the interval or ratio scale, then all three
measures of centrality (mean, median, and mode) are defined, provided that
the level of measurement precision does not preclude the determination of
a mode. When data are symmetric and unimodal, the mode, median, and
mean are approximately equal (the relative positions of the three measures
in cases of asymmetric distributions is discussed in Section 1.4). Finally, if the
data are qualitative (measured on the nominal or ordinal scales), using the
mean or median is senseless, and the mode must be used. For nominal data,
the mode is the category that contains the largest number of observations.
1.3 Measures of Variability
Variability is a statistical term used to describe and quantify the spread or
dispersion of data around the center (usually the mean). In most practical situations, knowledge of the average or expected value of a sample is not sufficient
to obtain an adequate understanding of the data. Sample variability provides a
measure of how dispersed the data are with respect to the mean (or other measures of central tendency). Figure 1.1 illustrates two distributions of data, one
that is highly dispersed and another that is relatively less dispersed around
the mean. There are several useful measures of variability, or dispersion. One
measure previously discussed is the IQR. Another measure is the range—
the difference between the largest and the smallest observations in the data.
While both the range and the IQR measure data dispersion, the IQR is more
resistant to outlying observations. The two most frequently used measures of
dispersion are the variance and its square root, the standard deviation.
6
Statistical, Econometric Methods: Transportation Data Analysis
Low variability
High variability
FIGURE 1.1
Examples of high- and low-variability data.
The variance and the standard deviation are typically more useful than the
range because, like the mean, they exploit all of the information contained in
the observations. The variance of a set of observations, or sample variance, is
the average squared deviation of the individual observations from the mean
and varies across samples. The sample variance is commonly used as an estimate of the population variance and is given by
∑ (x
i =1
− X)
2
n
s2 =
i
(1.3)
n−1
When a collection of observations constitutes an entire population, the
variance is denoted by σ 2 . Unlike the sample variance, the population variance is constant and is given by
∑ (x − µ)
N
σ =
2
2
i
i= 1
(1.4)
N
where X in Equation 1.3 is replaced by µ.
Because calculation of the variance involves squaring differences of the
raw data measurement scales, the measurement unit is the square of the original measurement scale—for example, the variance of measured distances in
meters is meters squared. While variance is a useful measure of the relative
variability of two sets of measurements, it is often desirable to express variability in the same measurement units as the raw data. Such a measure is
the square root of the variance, commonly known as the standard deviation.
The formulas for the sample and population standard deviations are given,
respectively, as
∑ (x
n
s= s =
2
i =1
i
−X
n−1
)
2
(1.5)