19Count Data and Related Models
19.1 Why Count Data Models?
A count variable is a variable that takes on nonnegative integer values. Many vari-
ables that we would like to explain in terms of covariates come as counts. A few
examples include the number of times someone is arrested during a given year,
number of emergency room drug episodes during a given week, number of cigarettes
smoked per day, and number of patents applied for by a firm during a year. These
examples have two important characteristics in common: there is no natural a priori
upper bound, and the outcome will be zero for at least some members of the popu-
lation. Other count variables do have an upper bound. For example, for the number
of children in a family who are high school graduates, the upper bound is number of
children in the family.
If y is the count variable and x is a vector of explanatory variables, we are often
interested in the population regression, Eðy jxÞ. Throughout this book we have dis-
cussed various models for conditional expectations, and we have discussed di¤erent
methods of estimation. The most straightforward approach is a linear model,
Eðy jxÞ¼xb, estimated by OLS. For count data, linear models have shortcomings
very similar to those for binary responses or corner solution responses: because y b
0, we know that Eðy jxÞ should be nonnegative for all x.If
^
bb is the OLS estimator,
there usually will be values of x such that x
^
bb < 0—so that the predicted value of y is
negative.
For strictly positive variables, we often use the natural log transformation, logðyÞ,
and use a linear model. This approach is not possible in interesting count data
applications, where y takes on the value zero for a nontrivial fraction of the popula-
tion. Transformations could be applied that are defined for all y b 0— for example,
logð1 þ yÞ—but logð1 þ yÞ itself is nonnegative, and it is not obvious how to recover
Eðy jxÞ from a linear model for E½logð1 þ yÞjx. With count data, it is better to
model Eðy jxÞ directly and to choose functional forms that ensure positivity for any
value of x and any parameter values. When y has no upper bound, the most popular
of these is the exponential function, Eðy jxÞ¼expðxbÞ.
In Chapter 12 we discussed nonlinear least squares (NLS) as a general method for
estimating nonlinear models of conditional means. NLS can certainly be applied to
count data models, but it is not ideal: NLS is relatively ine‰cient unless Varðy jxÞ is
constant (see Chapter 12), and all of the standard distributions for count data imply
heteroskedasticity.
In Section 19.2 we discuss the most popular model for count data, the Poisson re-
gression model. As we will see, the Poisson regression model has some nice features.
First, if y given x has a Poisson distribution—which used to be the maintained
assumption in count data contexts—then the conditional maximum likelihood esti-
mators are fully e‰cient. Second, the Poisson assumption turns out to be unneces-
sary for consistent estimation of the conditional mean parameters. As we will see in
Section 19.2, the Poisson quasi–maximum likelihood estimator is fully robust to dis-
tributional misspecification. It also maintains certain e‰ciency properties even when
the distribution is not Poisson.
In Section 19.3 we discuss other count data models, and in Section 19.4 we cover
quasi-MLEs for other nonnegative response variables. In Section 19.5 we cover mul-
tiplicative panel data models, which are motivated by unob served e¤ects count data
models but can also be used for other nonnegative responses.
19.2 Poisson Regression Models with Cross Section Data
In Chapter 13 we used the basic Poisson regression model to illustrate maximum
likelihood estimation. Here, we study Poisson regression in much more detail, em-
phasizing the properties of the estimator when the Poisson distributional assumption
is incorrect.
19.2.1 Assumptions Used for Poisson Regression
The basic Poisson regression model assumes that y given x 1 ðx
1
; ; x
K
Þ has a
Poisson distribution, as in El Sayyad (1973) and Maddala (1983, Section 2.15). The
density of y given x under the Poisson assumption is completely determined by the
conditional mean mðxÞ1 Eðy jxÞ:
f ðy jxÞ¼exp½ÀmðxÞ½mðxÞ
y
=y!; y ¼ 0; 1; ð19:1Þ
where y! is y factorial. Given a parametric model for mðx Þ [such as mðxÞ¼expðxbÞ ]
and a random sample fðx
i
; y
i
Þ: i ¼ 1; 2; ; Ng on ðx; yÞ, it is fairly straightforward
to obtain the conditional MLEs of the parameters. The statistical properties then
follow from our treatment of CMLE in Chapter 13.
It has long been recognized that the Poisson distributional assumption imposes
restrictions on the conditional moments of y that are often violated in applications.
The most important of these is equality of the conditional variance and mean:
Varðy jxÞ¼Eðy jxÞð19:2Þ
The variance-mean equality has been rejected in numerous applications, and later we
show that assumption (19.2) is violated for fairly simple departures from the Poisson
Chapter 19646
model. Importantly, whether or not assump tion (19.2) holds has implications for how
we carry out statistical inference. In fact, as we will see, it is assumption (19.2), not
the Poisson assumption per se, that is important for large-sample inferen ce; this point
will become clear in Section 19.2.2. In what follows we refer to assumption (19.2) as
the Poisson variance assumption.
A weaker assumption allows the variance-mean ratio to be any positive constant:
Varðy jxÞ¼s
2
Eðy jxÞð19:3Þ
where s
2
> 0 is the variance-mean ratio. This assumption is used in the generalized
linear models (GLM) literature, and so we will refer to assumption (19.3) as the
Poisson GLM variance assumption. The GLM literature is concerned with quasi-
maximum likelihood estimation of a class of nonlinear models that contains Poisson
regression as a special case. We do not need to introduce the full GLM apparatus and
terminology to analyze Poisson regression. See McCullagh and Nelder (1989).
The case s
2
> 1 is empirically relevant because it implies that the variance is
greater than the mean; this situation is called overdispersion (relative to the Poisson
case). One distribution for y given x where assumption (19.3) holds with over-
dispersion is what Cameron and Trivedi (1986) call NegBin I—a particular param-
eterization of the negative binomial distribution. When s
2
< 1 we say there is
underdispersion. Underdispersion is less common than overdispersion, but under-
dispersion has been found in some applications.
There are plenty of count distributions for which assumption (19.3) does not
hold—for example, the NegBin II model in Cameron and Trivedi (1986). Therefore,
we are often interested in estimating the conditional mean parameters without speci-
fying the conditional variance. As we will see, Poisson regression turns out to be well
suited for this purpose.
Given a parametric model mðx; bÞ for mðxÞ, where b is a P Â 1 vector of parame-
ters, the log likelihood for observation i is
l
i
ðbÞ¼y
i
log½mðx
i
; bÞ À mðx
i
; bÞð19:4Þ
where we drop the term logðy
i
!Þ because it does not depend on the parameters b (for
computational reasons dropping this term is a good idea in practice, too, as y
i
! gets
very large for even moderate y
i
). We let B H R
P
denote the parameter space, which
is needed for the theoretical development but is practically unimportant in most
cases.
The most common mean function in applications is the exponential:
mðx; bÞ¼expðxbÞð19:5Þ
Count Data and Related Models 647
where x is 1 Â K and contains unity as its first element, and b is K Â 1. Under as-
sumption (19.5) the log likelihood is l
i
ðbÞ¼y
i
x
i
b À expðx
i
bÞ. The parameters in
model (19.5) are easy to interpret. If x
j
is continuous, then
qEðy jxÞ
qx
j
¼ expðxb Þb
j
and so
b
j
¼
qEðy jxÞ
qx
j
Á
1
Eðy jxÞ
¼
q log½Eðy jxÞ
qx
j
Therefore, 100b
j
is the semielasticity of Eðy jxÞ with respect to x
j
: for small changes
Dx
j
, the percentage change in Eðy jxÞ is roughly ð100b
j
ÞDx
j
. If we replace x
j
with
logðx
j
Þ, b
j
is the elasticity of Eðy jxÞ with respect to x
j
. Using assumption (19.5) as
the model for Eðy jxÞ is analogous to using logðyÞ as the dependent variable in linear
regression analysis.
Quadratic terms can be added with no additional e¤ort, except in interpreting the
parameters. In what follows, we will write the exponential function as in assumption
(19.5), leaving transformations of x—such as logs, quad ratics, interaction terms, and
so on—implicit. See Wooldridge (1997c) for a discussion of other functional forms.
19.2.2 Consistency of the Poisson QMLE
Once we have specified a conditional mean function, we are interested in cases where,
other than the conditional mean, the Poisson distribution can be arbitrarily mis-
specified (subject to regularity conditions). When y
i
given x
i
does not have a Poisson
distribution, we call the estimator
^
bb that solves
max
b A B
X
N
i¼1
l
i
ðbÞð19:6Þ
the Poisson quasi–maximum likelihood estimator (QMLE). A careful discussion of
the consistency of the Poiss on QMLE requires introduction of the true value of the
parameter, as in Chapters 12 and 13. That is, we assume that for some value b
o
in the
parameter space B,
Eðy jxÞ¼mðx; b
o
Þð19:7Þ
To prove consistency of the Poisson QMLE under assumption (19.5), the key is to
show that b
o
is the unique solution to
max
b A B
E½l
i
ðbÞ ð19:8Þ
Chapter 19648
Then, under the regularity conditions listed in Theorem 12.2, it follows from this
theorem that the solution to equation (19.6) is weakly consistent for b
o
.
Wooldridge (1997c) provides a simple proof that b
o
is a solution to equation (19.8)
when assumption (19.7) holds (see also Problem 19.1). It also follows from the gen-
eral results on quasi -MLE in the linear exponential family (LEF) by Gourieroux,
Monfort, and Trognon (1984a) (hereafter, GMT, 1984a). Uniqueness of b
o
must be
assumed separately, as it depends on the distribution of x
i
. That is, in addition to
assumption (19.7), identification of b
o
requires some restrictions on the distribution
of explanatory variables, and these depend on the nature of the regression function
m. In the linear regression case, we require full rank of Eðx
0
i
x
i
Þ. For Poisson QMLE
with an exponential regression function expðxbÞ, it can be shown that multiple solu-
tions to equation (19.8) exist whenever there is perfect multicollinearity in x
i
, just as
in the linear regression case. If we rule out perfect multicollinearity, we can usually
conclude that b
o
is identified under assumption (19.7).
It is important to remember that consistency of the Poisson QMLE does not re-
quire any additional assumptions concerning the distribution of y
i
given x
i
. In par-
ticular, Varðy
i
jx
i
Þ can be virtually anything (subject to regularity conditions needed
to apply the results of Chapter 12).
19.2.3 Asymptotic Normality of the Poisson QMLE
If the Poission QMLE is consistent for b
o
without any assumptions beyond (19.7),
why did we introduce assumptions (19.2) and (19.3)? It turns out that whether these
assumptions hold determines which asymptotic variance matrix estimators and in-
ference procedures are valid, as we now show.
The asymptotic normality of the Poisson QMLE follows from Theorem 12.3. The
result is
ffiffiffiffiffi
N
p
ð
^
bb À b
o
Þ!
d
Normalð0; A
À1
o
B
o
A
À1
o
Þð19:9Þ
where
A
o
1 E½ÀH
i
ðb
o
Þ ð19:10Þ
and
B
o
1 E½s
i
ðb
o
Þs
i
ðb
o
Þ
0
¼Var½s
i
ðb
o
Þ ð19:11Þ
where we define A
o
in terms of minus the Hessian because the Poisson QMLE solves
a maximization rather than a minimization problem. Taking the gradient of equation
(19.4) and transposing gives the score for observation i as
s
i
ðbÞ¼‘
b
mðx
i
; bÞ
0
½y
i
À mðx
i
; bÞ=mðx
i
; bÞð19:12Þ
Count Data and Related Models 649
It is easily seen that, under assumption (19.7), s
i
ðb
o
Þ has a zero mean conditional on
x
i
. The Hessian is more complicated but, under assumption (19.7), it can be shown
that
ÀE½H
i
ðb
o
Þjx
i
¼‘
b
mðx
i
; b
o
Þ
0
‘
b
mðx
i
; b
o
Þ=mðx
i
; b
o
Þð19:13Þ
Then A
o
is the expected value of this expression (over the distribution of x
i
). A fully
robust asymptotic variance matrix estimator for
^
bb follows from equation (12.49):
X
N
i¼1
^
AA
i
!
À1
X
N
i¼1
^
ss
i
^
ss
0
i
!
X
N
i¼1
^
AA
i
!
À1
ð19:14Þ
where
^
ss
i
is obtained from equation (19.12) with
^
bb in place of b, and
^
AA
i
is the right-
hand side of equation (19.13) with
^
bb in place of b
o
. This is the fully robust variance
matrix estimator in the sense that it requires only assumption (19.7) and the regularity
conditions from Chapter 12.
The asymptotic variance of
^
bb simplifies under the GLM assumption (19.3). Main-
taining assumption (19.3) (where s
2
o
now denotes the true value of s
2
) and defining
u
i
1 y
i
À mðx
i
; b
o
Þ, the law of iterated expectations implies that
B
o
¼ E½ u
2
i
‘
b
m
i
ðb
o
Þ
0
‘
b
m
i
ðb
o
Þ=fm
i
ðb
o
Þg
2
¼ E½ Eðu
2
i
jx
i
Þ‘
b
m
i
ðb
o
Þ
0
‘
b
m
i
ðb
o
Þ=fm
i
ðb
o
Þg
2
¼s
2
o
A
o
since Eðu
2
i
jx
i
Þ¼s
2
o
m
i
ðb
o
Þ under assumptions (19.3) and (19.7). Therefore, A
À1
o
B
o
A
À1
o
¼ s
2
o
A
À1
o
, so we only need to estimate s
2
o
in addition to obtaining
^
AA. A consistent es-
timator of s
2
o
is obtained from s
2
o
¼ E½u
2
i
=m
i
ðb
o
Þ, which follows from assumption
(19.3) and iterated expectations. The usual analogy principle argument gives the
estimator
^
ss
2
¼ N
À1
X
N
i¼1
^
uu
2
i
=
^
mm
i
¼ N
À1
X
N
i¼1
ð
^
uu
i
=
ffiffiffiffiffi
^
mm
i
p
Þ
2
ð19:15Þ
The last representation shows that
^
ss
2
is simply the average sum of squared weighted
residuals, where the weights are the inverse of the estimated nominal standard devi-
ations. (In the GLM literature, the weighted residuals
~
uu
i
1
^
uu
i
=
ffiffiffiffiffi
^
mm
i
p
are sometimes
called the Pearson residuals. In earlier chapters we also call ed them standard ized
residuals.) In the GLM literature, a degrees-of-freedom adjustment is usually made
by replacing N
À1
with ðN À PÞ
À1
in equation (19.15).
Given
^
ss
2
and
^
AA, it is straightforward to obtain an estimate of Avarð
^
bbÞ under as-
sumption (19.3). In fact, we can write
Chapter 19650
Av
^
aarð
^
bbÞ¼
^
ss
2
^
AA
À1
=N ¼
^
ss
2
X
N
i¼1
‘
b
^
mm
0
i
‘
b
^
mm
i
=
^
mm
i
!
À1
ð19:16Þ
Note that the matrix is always positive definite when the inverse exists, so it produces
well-defined standard errors (given, as usual, by the square roots of the diagonal ele-
ments). We call these the GLM standard errors.
If the Poisson variance assumption (19.2) holds, things are even easier because s
2
is known to be unity; the estimated asymptotic variance of
^
bb is given in equation
(19.16) but with
^
ss
2
1 1. The same estimator can be derived from the MLE theory in
Chapter 13 as the inverse of the estimated information matrix (conditional on the x
i
);
see Section 13.5.2.
Under assumption (19.3) in the case of overdispersion ðs
2
> 1Þ, standard errors of
the
^
bb
j
obtained from equation (19.16) with
^
ss
2
¼ 1 will systematically underestimate
the asymptotic standard deviations, sometimes by a large factor. For example, if
s
2
¼ 2, the correct GLM standard errors are, in the limit, 41 percent larger than the
incorrect, nominal Poisson standard errors. It is common to see very significant
coe‰cients reported for Poisson regressions—a recent example is Model (1993)—but
we must interpret the standard errors with caution when they are obtained under as-
sumption (19.2). The GLM standard errors are easily obtained by multiplying the
Poisson standard errors by
^
ss 1
ffiffiffiffiffi
^
ss
2
p
. The most robust standard errors are obtained
from expression (19.14), as these are valid under any conditional variance assump-
tion. In practice, it is a good idea to report the fully robust standard errors along with
the GLM standard errors and
^
ss.
If y given x has a Poisson distribution, it follows from the general e‰ciency of the
conditional MLE—see Section 14.5.2—that the Poisson QMLE is fully e‰cient in
the class of estimators that ignores information on the marginal distribution of x.
A nice property of the Poisson QMLE is that it retains some e‰ciency for certain
departures from the Poisson assumption. The e‰ciency results of GMT (1984a) can
be applied here: if the GLM assumption (19.3) holds for some s
2
> 0, the Poisson
QMLE is e‰cient in the class of all QMLEs in the linear exponential family of dis-
tributions. In particular, the Poisson QMLE is more e‰cient than the nonlinear least
squares estimator, as well as many other QMLEs in the LEF, some of which we
cover in Sections 19.3 and 19.4.
Wooldridge (1997c) gives an example of Poisson regression to an economic model
of crime, where the response variable is number of arrests of a young man living in
California during 1986. Wooldridge finds overdispersion:
^
ss is either 1.228 or 1.172,
depending on the functional form for the conditional mean. The following example
shows that underdispersion is possible.
Count Data and Related Models 651
Example 19.1 (E¤ects of Education on Fertility): We use the data in FERTIL2.
RAW to estimate the e¤ects of education on women’s fertility in Botswana. The re-
sponse variable, children, is number of living children. We use a standard exponential
regression function, and the explanatory variables are years of schooling (educ), a
quadratic in age, and binary indicators for ever married, living in an urban area,
having electricity, and owning a television. The results are given in Table 19.1. A
linear regression model is also included, with the usual OLS standard errors. For
Poisson regression, the standard errors are the GLM standard errors. A total of 4,358
observations are used.
As expected, the signs of the coe‰cients agree in the linear and exponential mod-
els, but their interpretations di¤er. For Poisson regression, the coe‰cient on educ
implies that another year of education reduces expected number of children by about
2.2 percent, and the e¤ect is very statistically significan t. The linear model estimate
implies that another year of education reduces expected number of children by about
.064. (So, if 100 women get another year of education, we estimate they will have
about six fewer children.)
Table 19.1
OLS and Poisson Estimates of a Fertility Equation
Dependent Variable: children
Independent
Variable Linear (OLS)
Exponential
(Poisson QMLE)
educ À.0644
(.0063)
À.0217
(.0025)
age .272
(.017)
.337
(.009)
age
2
À.0019
(.0003)
À.0041
(.0001)
evermarr .682
(.052)
.315
(.021)
urban À.228
(.046)
À.086
(.019)
electric À.262
(.076)
À.121
(.034)
tv À.250
(.090)
À.145
(.041)
constant À3.394
(.245)
À5.375
(.141)
Log-likelihood value — À6,497.060
R-squared .590 .598
^
ss 1.424 .867
Chapter 19652
The estimate of s in the Poisson regression implies underdispersion: the variance is
less than the mean. (Incidentally, the
^
ss’s for the linear and Poisson models are not
comparable.) One implication is that the GLM standard errors are actually less than
the corresponding Poisson MLE standard errors.
For the linear model, the R-squared is the usual one. For the exponential model,
the R-squared is computed as the squared correlation coe‰cient between children
i
and chil
ˆ
dren
i
¼ expðx
i
^
bbÞ . The exponential regression function fits slightly better.
19.2.4 Hypothesis Testing
Classical hypothesis testing is fairly straightforward in a QMLE setting. Testing
hypotheses about individual parameters is easily carried out using asymptotic t sta-
tistics after computing the appropriate standard error, as we discussed in Section
19.2.3. Multiple hypotheses tests can be carried out using the Wald, quasi–likelihood
ratio, or score test. We covered these generally in Sections 12.6 and 13.6, and they
apply immediately to the Poisson QMLE.
The Wald statistic for testi ng nonlinear hypotheses is computed as in equation
(12.63), where
^
VV is chosen appropriately depending on the degree of robustness
desired, with expression (19.14) being the most robust. The Wald statistic is conve-
nient for testing multiple exclusion restrictions in a robust fashion.
When the GLM assumption (19.3) holds, the quasi–likelihood ratio statistic can be
used. Let
bb be the restricted estimator, where Q restrictions of the form cð
bbÞ¼0 have
been imposed. Let
^
bb be the unrestricted QMLE. Let LðbÞ be the quasi– log likeli-
hood for the sample of size N, given in expression (19.6). Let
^
ss
2
be given in equation
(19.15) (with or without the degrees-of-freedom adjustment), where the
^
uu
i
are the
residuals from the unconstrained maximization. The QLR statistic,
QLR 1 2½Lð
^
bbÞÀLð
bbÞ=
^
ss
2
ð19:17Þ
converges in distribution to w
2
Q
under H
0
, under the conditions laid out in Section
12.6.3. The division of the usual likelihood ratio statistic by
^
ss
2
provides for some
degree of robustness. If we set
^
ss
2
¼ 1, we obtain the usual LR statistic, which is valid
only under assumption (19.2). There is no usable quasi-LR statistic when the GLM
assumption (19.3) does not hold.
The score test can also be used to test multiple hypotheses. In this case we estimate
only the restricted model. Partition b as ða
0
; g
0
Þ
0
, where a is P
1
 1 and g is P
2
 1,
and assume that the null hypothesis is
H
0
: g
o
¼ g ð19:18Þ
where
g is a P
2
 1 vector of specified constants (often, g ¼ 0). Let
bb be the estimator
of b obtained under the restriction g ¼
g [so
bb 1 ð
aa
0
; g
0
Þ
0
, and define quantities under
Count Data and Related Models 653
the restricted estimation as
mm
i
1 mðx
i
;
bbÞ ,
uu
i
1 y
i
À
mm
i
, and ‘
b
mm
i
1 ð‘
a
mm
i
; ‘
g
mm
i
Þ1
‘
b
mðx
i
;
bbÞ . Now weight the residuals and gradient by the inverse of nominal Poisson
standard deviation, estimated under the null, 1=
ffiffiffiffiffi
mm
i
p
:
~
uu
i
1
uu
i
=
ffiffiffiffiffi
mm
i
p
; ‘
b
~
mm
i
1 ‘
b
mm
i
=
ffiffiffiffiffi
mm
i
p
ð19:19Þ
so that the
~
uu
i
here are the Pearson residuals obtained under the null. A form of the
score statistic that is valid under the GLM assumption (19.3) [and therefore under
assumption (19.2)] is NR
2
u
from the regression
~
uu
i
on ‘
b
~
mm
i
; i ¼ 1; 2; ; N ð19:20Þ
where R
2
u
denotes the uncentered R-squared. Under H
0
and assumption (19.3),
NR
2
u
@
a
w
2
P
2
. This is identical to the score statistic in equation (12.68) but where we
use
~
BB ¼
~
ss
2
~
AA, where the notation is self-explanatory. For more, see Wooldridge
(1991a, 1997c).
Following our development for nonlinear regression in Section 12.6.2, it is easy to
obtain a test that is completely robust to variance misspecification. Let
~
rr
i
denote the
1 Â P
2
residuals from the regression
‘
g
~
mm
i
on ‘
a
~
mm
i
ð19:21Þ
In other words, regress each element of the weighted gradient with respect to the
restricted parameters on the weighted gradient with resp ect to the unrestricted
parameters. The residuals are put into the 1 Â P
2
vector
~
rr
i
. The robust score statistic
is obtained as N ÀSSR from the regression
1on
~
uu
i
~
rr
i
; i ¼ 1; 2; ; N ð19:22Þ
where
~
uu
i
~
rr
i
¼ð
~
uu
i
~
rr
i1
;
~
uu
i
~
rr
i2
; ;
~
uu
i
~
rr
iP
2
Þ is a 1 Â P
2
vector.
As an example, consider testing H
0
: g ¼ 0 in the exponential model Eðy jxÞ¼
expðxbÞ¼expðx
1
a þ x
2
gÞ. Then ‘
b
mðx; bÞ¼expðxb Þx. Let
aa be the Poisson QMLE
obtained under g ¼ 0, and define
mm
i
1 expðx
i1
aaÞ, with
uu
i
the residuals. Now ‘
a
mm
i
¼
expðx
i1
aaÞx
i1
, ‘
g
mm
i
¼ expðx
i1
aaÞx
i2
, and ‘
b
~
mm
i
¼
mm
i
x
i
=
ffiffiffiffiffi
mm
i
p
¼
ffiffiffiffiffi
mm
i
p
x
i
. Therefore, the
test that is valid under the GLM variance assumption is NR
2
u
from the OLS regres-
sion
~
uu
i
on
ffiffiffiffiffi
mm
i
p
x
i
, where the
~
uu
i
are the weighted residuals. For the robust test, first
obtain the 1 ÂP
2
residuals
~
rr
i
from the regression
ffiffiffiffiffi
mm
i
p
x
i2
on
ffiffiffiffiffi
mm
i
p
x
i1
; then obtain the
statistic from regression (19.22).
19.2.5 Specification Testing
Various specification tests have been proposed in the context of Poisson regression.
The two most important kinds are conditional mean specification tests and condi-
Chapter 19654
tional variance specification tests. For conditional mean tests, we usually begin with a
fairly simple model whose parameters are easy to interpret—such as mðx; b Þ¼
expðxbÞ—and then test this against other alternatives. Once the set of conditioning
variables x has been specified, all such tests are functional form tests.
A useful class of functional form tests can be obtained using the score principle,
where the null model mðx; bÞ is nested in a more general model. Fully robust tests
and less robust tests are obtained exactly as in the previous section. Wooldridge
(1997c, Section 3.5) contains details and some examples, including an extension of
RESET to exponential regression models.
Conditional variance tests are more di‰cult to compute, especially if we want to
maintain only that the first two moments are correctly specified under H
0
. For ex-
ample, it is very natural to test the GLM assumption (19.3) as a way of determining
whether the Poisson QMLE is e‰cient in the class of estimators using only assump-
tion (19.7). Cameron and Trivedi (1986) propose tests of the stronger assumption
(19.2) and, in fact, take the null to be that the Poisson distribution is correct in its
entirety. These tests are useful if we are interested in whether y given x truly has a
Poisson distribution. However, assumption (19.2) is not necessary for consistency or
relative e‰ciency of the Poisson QMLE.
Wooldridge (1991b) proposes fully robust tests of conditional variances in the
context of the linear exponential family, which contains Poisson regression as a spe-
cial case. To test assumption (19.3), write u
i
¼ y
i
À mðx
i
; b
o
Þ and note that, under
assumptions (19.3) and (19.7), u
2
i
À s
2
o
mðx
i
; b
o
Þ is uncorrelated with any function of
x
i
. Let hðx
i
; bÞ be a 1 Â Q vector of functions of x
i
and b, and consider the alterna-
tive model
Eðu
2
i
jx
i
Þ¼s
2
o
mðx
i
; b
o
Þþhðx
i
; b
o
Þd
o
ð19:23Þ
For example, the elements of hðx
i
; bÞ can be powers of mðx
i
; bÞ. Popular choices are
unity and fmðx
i
; bÞg
2
. A test of H
0
: d
o
¼ 0 is then a test of the GLM assumption.
While there are several moment conditions that can be used, a fruitful one is to use
the weighted residuals, as we did with the conditional mean tests. We base the test on
N
À1
X
N
i¼1
ð
^
hh
i
=
^
mm
i
Þ
0
fð
^
uu
2
i
À
^
ss
2
^
mm
i
Þ=
^
mm
i
g¼N
À1
X
N
i¼1
~
hh
0
i
ð
~
uu
2
i
À
^
ss
2
Þð19:24Þ
where
~
hh
i
¼
^
hh
i
=
^
mm
i
and
~
uu
i
¼
^
uu
i
=
ffiffiffiffiffi
^
mm
i
p
. (Note that
^
hh
i
is weighted by 1=
^
mm
i
, not 1=
ffiffiffiffiffi
^
mm
i
p
.)
To turn this equation into a test statistic, we must confront the fact that its stan-
dardized limiting distribution depends on the limiting distributions of
ffiffiffiffiffi
N
p
ð
^
bb À b
o
Þ
and
ffiffiffiffiffi
N
p
ð
^
ss
2
À s
2
o
Þ. To handle this problem, we use a trick suggested by Wooldridge
Count Data and Related Models 655
(1991b) that removes the dependence of the limiting distribution of the test statistic
on that of
ffiffiffiffiffi
N
p
ð
^
ss
2
À s
2
o
Þ: replace
~
hh
i
in equation (19.24) with its demeaned counter-
part,
~
rr
i
1
~
hh
i
À h, where h is just the 1 Â Q vector of sample averages of each element
of
~
hh
i
. There is an additional purging that then leads to a simple regression-based
statistic. Let ‘
b
^
mm
i
be the unweighted gradient of the conditional mean function,
evaluated at the Poisson QMLE
^
bb, and define ‘
b
^
mm
i
1 ‘
b
^
mm
i
=
ffiffiffiffiffi
^
mm
i
p
, as before. The fol-
lowing steps come from Wooldridge (1991b, Procedure 4.1):
1. Obtain
^
ss
2
as in equation (19.15) and
^
AA as in equation (19.16), and define the
P Â Q matrix
^
JJ ¼
^
ss
2
ðN
À1
P
N
i¼1
‘
b
^
mm
0
i
~
rr
i
=
^
mm
i
Þ.
2. For each i, define the 1 Â Q vector
^
zz
i
1 ð
~
uu
2
i
À
^
ss
2
Þ
~
rr
i
À
^
ss
0
i
^
AA
À1
^
JJ ð19:25Þ
where
^
ss
i
1 ‘
b
~
mm
0
i
~
uu
i
is the Poisson score for observation i.
3. Run the regression
1on
^
zz
i
; i ¼ 1; 2; ; N ð19:26Þ
Under assumptions (19.3) and (19.7), N À SSR from this regre ssion is distributed
asymptotically as w
2
Q
.
The leading case occurs when
^
mm
i
¼ expðx
i
^
bbÞ and ‘
b
^
mm
i
¼ expðx
i
^
bbÞ x
i
¼
^
mm
i
x
i
. The
subtraction of
^
ss
0
i
^
AA
À1
^
JJ in equation (19.25) is a simple way of handling the fact that the
limiting distribution of
ffiffiffiffiffi
N
p
ð
^
bb À b
o
Þ a¤ects the limiting distribution of the unadjusted
statistic in equation (19.24). This particular adjustment ensures that the tests are just
as e‰cient as any maximum-likelihood-based statistic if s
2
o
¼ 1 and the Poisson as-
sumption is correct. But this procedure is fully robust in the sense that only assump-
tions (19.3) and (19.7) are maintained under H
0
. For further discussion the reader is
referred to Wooldridge (1991b).
In practice, it is probably su‰cient to choose the number of elements in Q to be
small. Setting
^
hh
i
¼ð1;
^
mm
2
i
Þ,sothat
~
hh
i
¼ð1=
^
mm
i
;
^
mm
i
Þ, is likely to produce a fairly power-
ful two-degrees-of-freedom test against a fairly broad class of alternatives.
The procedure is easily modified to test the more restrictive assumption (19.2). First,
replace
^
ss
2
everywhere with unity. Second, there is no need to demean the auxiliary
regressors
~
hh
i
(so that now
~
hh
i
can contain a constant); thus, wherever
~
rr
i
appears, sim-
ply use
~
hh
i
. Everything else is the same. For the reasons discussed earlier, when the
focus is on Eðy jxÞ, we are more interested in testing assumption (19.3) than as-
sumption (19.2).
Chapter 19656
19.3 Other Count Data Regression Models
19.3.1 Negative Binomial Regression Models
The Poisson regressi on model nominally maintains assumption (19.2) but retains
some asymptotic e‰ciency under assumption (19.3). A popular alternative to the
Poisson QMLE is full maximum likelihood analysis of the NegBin I model of
Cameron and Trivedi (1986). NegBin I is a particular parameterization of the nega-
tive binomial distribution. An important restriction in the NegBin I model is that it
implies assumption (19.3) with s
2
> 1, so that there cannot be underdispersion. (We
drop the ‘‘o’’ subscript in this section for notational simplicity.) Typically, NegBin I
is parameterized through the mean parameters b and an addition al parameter,
h
2
> 0, where s
2
¼ 1 þh
2
. On the one hand, when b and h
2
are estimated jointly, the
maximum likelihood estimators are generally inconsistent if the NegBin I assumption
fails. On the other hand, if the NegBin I distribution holds, then the NegBin I MLE is
more e‰cient than the Poisson QMLE (this conclusion follows from Section 14.5.2).
Still, under assumption (19.3), the Poisson QMLE is more e‰cient than an estimator
that requires only the conditional mean to be correctly specified for consistency. On
balance, because of its robustness, the Poisson QMLE has the edge over NegBin I for
estimating the parameters of the conditional mean. If conditional probabilities need
to be estimated, then a more flexible model is probably warranted.
Other count data distributions imply a conditional variance other than assumption
(19.3). A leading example is the NegBin II model of Cameron and Trivedi (1986).
The NegBin II model can be derived from a model of unobserved heterogeneity in a
Poisson model. Specifically, let c
i
> 0 be unobserved heterogeneity, and assume that
y
i
jx
i
; c
i
@ Poisson½c
i
mðx
i
; bÞ
If we further assume that c
i
is independent of x
i
and has a gamma distribution with
unit mean and Varðc
i
Þ¼h
2
, then the distribution of y
i
given x
i
can be shown to be
negative binomial, with conditional mean and variance
Eðy
i
jx
i
Þ¼mðx
i
; bÞ; ð19:27Þ
Varðy
i
jx
i
Þ¼E½Varðy
i
jx
i
; c
i
Þjx
i
þVar½Eðy
i
jx
i
; c
i
Þjx
i
¼ mðx
i
; bÞþh
2
½mðx
i
; bÞ
2
ð19:28Þ
so that the conditional variance of y
i
given x
i
is a quadratic in the conditional mean.
Because we can write equation (19.28) as Eðy
i
jx
i
Þ½1 þ h
2
Eðy
i
jx
i
Þ, NegBin II also
Count Data and Related Models 657
implies overdispersion, but where the amount of overdispersion increases with
Eðy
i
jx
i
Þ.
The log-likelihood function for observation i is
l
i
ðb; h
2
Þ¼h
À2
log
h
À2
h
À2
þ m ðx
i
; bÞ
þ y
i
log
mðx
i
; bÞ
h
À2
þ m ðx
i
; bÞ
þ log ½Gðy
i
þ h
À2
Þ=Gðh
À2
Þ ð19:29Þ
where GðÁÞ is the gamma function defined for r > 0byGðrÞ¼
Ð
y
0
z
rÀ1
expðÀzÞdz.
You are referred to Cameron and Trivedi (1986) for details. The parameters b and
h
2
can be jointly estimated using standard maximum likelihood methods.
It turns out that, for fixed h
2
, the log likelihood in equation (19.29) is in the linear
exponential family; see GMT (1984a). Therefore, if we fix h
2
at any positive value,
say
h
2
, and estimate b by maximizing
P
N
i¼1
l
i
ðb; h
2
Þ with respect to b, then the
resulting QMLE is consistent under the conditional mean assumption (19.27) only:
for fixed h
2
, the negative binomial QMLE has the same robustness properties as the
Poisson QMLE. (Notice that when h
2
is fixed, the term involving the gamma func-
tion in equation (19.29) does not a¤ect the QMLE.)
The structure of the asymptotic variance estimators and test statistics is very simi-
lar to the Poisson regression case. Let
^
vv
i
¼
^
mm
i
þ h
2
^
mm
2
i
ð19:30Þ
be the estimated nominal variance for the given value
h
2
. We simply weight the
residuals
^
uu
i
and gradient ‘
b
^
mm
i
by 1=
ffiffiffiffi
^
vv
i
p
:
~
uu
i
¼
^
uu
i
=
ffiffiffiffi
^
vv
i
p
; ‘
b
~
mm
i
¼ ‘
b
^
mm
i
=
ffiffiffiffi
^
vv
i
p
ð19:31Þ
For example, under conditions (19.27) and (19.28), a valid estimator of Avarð
^
bbÞ is
X
N
i¼1
‘
b
^
mm
0
i
‘
b
^
mm
i
=
^
vv
i
!
À1
If we drop condition (19.28), the estimator in expression (19.14) should be used but
with the standardized residuals and gradients given by equation (19.31). Score sta-
tistics are modified in the same way.
When h
2
is set to unity, we obtain the geometric QMLE. A better approach is to
replace h
2
by a first-stage estimate, say
^
hh
2
, and then estimate b by two-step QMLE.
As we discussed in Chapters 12 and 13, sometimes the asymptotic distribution of the
first-stage estimator needs to be taken into account. A nice feature of the two-step
Chapter 19658
QMLE in this context is that the key condition, assumption (12.37), can be shown
to hold under assumption (19.27). Therefore, we can ignore the first-stage estimation
of h
2
.
Under assumption (19.28), a consistent estimator of h
2
is easy to obtain, given an
initial estimator of b (such as the Poisson QMLE or the geometric QMLE). Given
^
^
bb
^
bb,form
^
^
mm
^
mm
i
and
^
^
uu
^
uu
i
as the usual fitted values and residuals. One consistent estimator
of h
2
is the coe‰cient on
^
^
mm
^
mm
2
i
in the regression (through the origin) of
^
^
uu
^
uu
2
i
À
^
^
mm
^
mm
i
on
^
^
mm
^
mm
2
i
;
this is the estimator suggested by Gourieroux, Monfort, and Trognon (1984b) and
Cameron and Trivedi (1986). An alternative estimator of h
2
, which is closely related
to the GLM estimator of s
2
suggested in equation (19.15), is a weighted least squares
estimate, which can be obtained from the OLS regression
~
~
uu
~
uu
2
i
À 1on
^
^
mm
^
mm
i
, where the
~
~
uu
~
uu
i
are residuals
^
^
uu
^
uu
i
weighted by
^
^
mm
^
mm
À1=2
i
. The resulting two-step estimator of b is consistent
under assumption (19.7) only, so it is just as robust as the Poisson QMLE. It makes
sense to use fully robust standard errors and test statistics. If assumption (19.3) holds,
the Poisson QMLE is asymptotically more e‰cient; if assumption (19.28) holds, the
two-step negative binomial estimator is more e‰cient. Notice that neither variance
assumption contains the other as a special case for all parameter values; see Wool-
dridge (1997c) for additional discussion.
The variance specification tests discussed in Section 19.2.5 can be extended to the
negative binomial QMLE; see Wooldridge (1991b).
19.3.2 Binomial Regression Models
Sometimes we wish to analyze count data conditional on a known upper bound. For
example, Thomas, Strauss, and Henriques (1990) study child mortality within families
conditional on number of children ever born. Another example takes the dependent
variable, y
i
, to be the number of adult children in family i who are high school gradu-
ates; the known upper bound, n
i
, is the number of children in family i. By conditioning
on n
i
we are, presumably, treating it as exogenous.
Let x
i
be a set of exogenous variables. A natural starting point is to assume that
y
i
given ðn
i
; x
i
Þ has a binomial distribution, denoted Binomial ½n
i
; pðx
i
; bÞ, where
pðx
i
; bÞ is a function bounded between zero and one. Usually, y
i
is viewed as the sum
of n
i
independent Bernoulli (zero-one) random variables, and pðx
i
; bÞ is the (condi-
tional) probability of success on each trial.
The binomial assumption is too restrictive for all applicat ions. The presence of an
unobserved e¤ect would invalidate the binomial assumption (after the e¤ect is inte-
grated out). For example, when y
i
is the number of children in a family graduating
from high school, unobserved family e¤ects may play an important role.
Count Data and Related Models 659
As in the case of unbounded support, we assume that the conditional mean is
correctly specified:
Eðy
i
jx
i
; n
i
Þ¼n
i
pðx
i
; bÞ1 m
i
ðbÞð19:32Þ
This formulation ensures that Eðy
i
jx
i
; n
i
Þ is between zero and n
i
. Typically,
pðx
i
; bÞ¼Gðx
i
bÞ, where GðÁÞ is a cumu lative distribution function, such as the
standard normal or logistic function.
Given a parametric model pð x; b Þ, the binomial quasi–log likelihood for observa-
tion i is
l
i
ðbÞ¼y
i
log½pðx
i
; bÞ þ ðn
i
À y
i
Þ log½1 À pðx
i
; bÞ ð19:33Þ
and the binomial QML E is obtained by maximizing the sum of l
i
ðbÞ over all N
observations. From the results of GMT (1984a), the conditional mean parameters are
consistently estimated under assumption (19.32) only. This conclusion follows from
the general M-estimation results after showing that the true value of b maximizes the
expected value of equation (19.33) under assumption (19.32) only.
The binomial GLM variance assumption is
Varðy
i
jx
i
; n
i
Þ¼s
2
n
i
pðx
i
; bÞ½1 À pðx
i
; bÞ ¼ s
2
v
i
ðbÞð19:34Þ
which generalizes the nominal binomial assumption with s
2
¼ 1. [McCullagh and
Nelder (1989, Section 4.5) discuss a model that leads to assumption (19.34) with s
2
>
1. But underdispersion is also possible.] Even the GLM assumption can fail if the
binary outcomes comprising y
i
are not independent conditional on ðx
i
; n
i
Þ. Therefore,
it makes sense to use the fully robust asymptotic variance estimator for the binomial
QMLE.
Owing to the structure of LEF densities, and given our earlier analysis of the
Poisson and negative binomial cases, it is straightforward to describe the econometric
analysis for the binomial QMLE: simpl y take
^
mm
i
1 n
i
pðx
i
;
^
bbÞ,
^
uu
i
1 y
i
À
^
mm
i
, ‘
b
^
mm
i
1
n
i
‘
b
^
pp
i
, and
^
vv
i
1 n
i
^
pp
i
ð1 À
^
pp
i
Þ in equations (19.31). An estimator of s
2
under assump-
tion (19.34) is also easily obtained: replace
^
mm
i
in equation (19.15) with
^
vv
i
. The struc-
ture of asymptotic variances and score tests is identical.
19.4 Other QMLES in the Linear Exponential Family
Sometimes we want to use a quasi-MLE analysis for other kinds of response vari-
ables. We will consider two here. The exponential regression model is well suited to
strictly positive, roughly continuous responses. Fractional logit regression can be used
when the response variable takes on values in the unit interval.
Chapter 19660
19.4.1 Exponential Regression Models
Just as in the Poisson regression model, in an exponential regression model we specify
a conditional mean function, mðx; b Þ. However, we now use the exponential quasi–
log likelihood function, l
i
ðbÞ¼Ày
i
=mðx
i
; bÞÀlog½mðx
i
; bÞ. [The ‘‘exponential’’ in
‘‘exponential regression model’’ refers to the quasi-likelihood used, and not to the
mean function mðx; bÞ.] The most popular choice of mðx; bÞ happens to be expðxb Þ.
The results of GMT (1984a) imply that, provided the conditional mean is correctly
specified, the exponential QMLE consistently estimates the conditional mean param-
eters. Thus the exponential QMLE enjoys the same robustness properties as the
Poisson QMLE.
The GLM variance assumption for exponential regression is
Varðy jxÞ¼s
2
½Eðy jxÞ
2
ð19:35Þ
When s
2
¼ 1, assumption (19.35) gives the variance-mean relationship for the expo-
nential distribution. Under assumption (19.35), s is the coe‰cient of variation:itis
the ratio of the conditional standard deviation of y to its conditional mean.
Whether or not assumption (19.35) holds, an asymptotic variance matrix can be
estimated. The fully robust form is expression (19.14), but, in defining the score and
expected Hessian, the residuals and gradients are weighted by 1=
^
mm
i
rather than
^
mm
À1=2
i
. Under assumption (19.35), a valid estimator is
^
ss
2
X
N
i¼1
‘
b
^
mm
0
i
‘
b
^
mm
i
=
^
vv
i
!
À1
where
^
ss
2
¼ N
À1
P
N
i¼1
^
uu
2
i
=
^
mm
2
i
and
^
vv
i
¼
^
mm
2
i
. Score tests and quasi–likelihood ratio tests
can be computed just as in the Poisson case. Most statistical packages implement
exponential regression with an exponential mean function; it is sometimes called the
gamma regression model because the exponential distribution is a special case of the
gamma distribution.
19.4.2 Fractional Logit Regression
Quasi-likelihood methods are also available when y is a variable restricted to the unit
interval, ½0; 1. fBy rescaling, we can cover the case where y is restricted to the in-
terval ½a; b for known constants a < b. The transformation is ðy ÀaÞ=ðb Àa Þ.g
Examples include fraction of income contributed to charity, fraction of weekly hours
spent working, proportion of a firm’s total capitalization accounted for by debt cap-
ital, and high school graduation rates. In some cases, each y
i
might be obtained by
dividing a count variable by an upper bound, n
i
.
Count Data and Related Models 661
Given explanatory variables x, a linear model for Eðy jxÞ has the same strengths
and weaknesses as the linear probability model for binary y.Wheny is strictly be-
tween zero and one, a popular alternative is to assume that the log-odds transforma-
tion,log½y=ð1 À yÞ, has a conditional expectation of the form xb. The motivation for
using log½y=ð1 À yÞ as a dependent variable in a linear model is that log½y=ð1 À yÞ
ranges over all real values as y ranges between zero and one. This approach leads to
estimation of b by OLS. Unfortunately, using the log-odds transformation has two
drawbacks. First, it cannot be used directly if y takes on the boundary values, zero
and one. While we can always use adjustments for the boundary values, such
adjustments are necessarily arbitrar y. Second, even if y is strictly inside the unit in-
terval, b is di‰cult to interpret: without further assumptions, it is not possible to re-
cover an estimate of Eðy jxÞ, and with further assumptions, it is still nontrivial to
estimate Eðy jxÞ. See Papke and Wooldridge (1996) and Problem 19.8 for further
discussion.
An approach that avoids both these problems is to model Eðy jxÞ as a lo gistic
function:
Eðy jxÞ¼expðxb Þ=½1 þexpðxbÞ ð19:36Þ
This model ensures that predicted values for y are in ð0; 1Þ and that the e¤ect of any
x
j
on Eðy jxÞ diminishes as xb ! y. Just as in the binary logit model, qEðy jxÞ=qx
j
¼ b
j
gðxb Þ, where gðzÞ¼exp ðzÞ=½1 þ expðzÞ
2
. In applications, the partial e¤ects
should be evaluated at the
^
bb
j
and interesting values of x. Plugging in the sample
averages,
x, makes the partial e¤ects from equation (19.36) roughly comparable to
the coe‰cients from a linear regression for Eðy jxÞ:
^
gg
j
A
^
bb
j
gðx
^
bbÞ , where the
^
gg
j
are the
OLS estimates from the linear regression of y on x.
Given equation (19.36), one approach to estimating b is nonlinear least squares, as
we discussed in Chapter 12. However, the assumption that implies relative e‰ciency
of NLS—namely, Varðy jxÞ¼s
2
—is unlikely to hold for fractional y. A method
that is just as robust [in the sense that it consistently estimates b under assumption
(19.36) only] is quasi-MLE, where the quasi-likelihood function is the binary choice
log likelihood. Therefore, quasi–log likelihood for observation i is exactly as in
equation (15.17) [with GðÁÞ the logistic function], although y
i
can be any value in
½0; 1. The mechanics of obtaining
^
bb are identical to the binary response case.
Inference is complicated by the fact that the binary response density cannot be the
actual density of y given x. Generally, a fully robust variance matrix estimator and
test statistics should be obtained. These are gotten by applying the formulas for the
binomial case with n
i
1 1 and pð x; b Þ1 expðxbÞ=½1 þ expðxbÞ. The GLM assump-
tion for fractional logit regression is given in assumption (19.34) with n
i
¼ 1. See
Chapter 19662
Papke and Wooldridge (1996) for more details, as well as suggestions for specifica-
tion tests and for an application to participation rates in 401(k) pension plans.
19.5 Endogeneity and Sample Selection with an Exponential Regression Function
With all of the previous models, standard econometric problems can arise. In this
section, we will study two of the problems when the regression function for y has an
exponential form: endogeneity of an explanatory variable and incidental truncation.
We follow the methods in Wooldridge (1997c), which are closely related to those
suggested by Terza (1998). Gurmu and Trivedi (1994) and the references therein dis-
cuss the problems of data censoring, truncation, and two-tier or hurdle models.
19.5.1 Endogeneity
We approach the problem of endogenous explanatory variables from an omitted
variables perspective. Let y
1
be the nonnegative, in principle unbounded variable to
be explained, and let z and y
2
be observable explanatory variables (of dimension
1 Â L and 1 Â G
1
, respectively). Let c
1
be an unobserved latent variable (or unob-
served heterogeneity). We assume that the (structural) model of interest is an omitted
variables model of exponential form, written in the population as
Eðy
1
jz; y
2
; c
1
Þ¼expðz
1
d
1
þ y
2
g
1
þ c
1
Þð19:37Þ
where z
1
is a 1 ÂL
1
subset of z containing unity; thus, the model (19.37) incorporates
some exclusion restrictions. On the one hand, the elements in z are assumed to be
exogenous in the sense that they are independent of c
1
. On the other hand, y
2
and c
1
are allowed to be correlated, so that y
2
is potentially endogenous.
To use a quasi-likelihood approach, we assume that y
2
has a linear reduced form
satisfying certain assumptions. Write
y
2
¼ zP
2
þ v
2
ð19:38Þ
where P
2
is an L Â G
1
matrix of reduced form parameters and v
2
is a 1 Â G
1
vector
of reduced form errors. We assume that the rank condition for identification holds,
which requires the order condition L ÀL
1
b G
1
. In addition, we assume that ðc
1
; v
2
Þ
is independent of z, and that
c
1
¼ v
2
r
1
þ e
1
ð19:39Þ
where e
1
is independent of v
2
(and necessarily of z). (We could relax the independence
assumptions to some degree, but we cannot just assume that v
2
is uncorrelated with z
Count Data and Related Models 663
and that e
1
is uncorrelated with v
2
.) It is natural to assume that v
2
has zero mean, but
it is convenient to assume that E½expðe
1
Þ ¼ 1 rather than Eðe
1
Þ¼0. This assumption
is without loss of generality whenever a constant appears in z
1
, which should almost
always be the case.
If ðc
1
; v
2
Þ has a multivariate normal distribution, then the representation in equa-
tion (19.39) under the stated assumptions always holds. We could also extend equa-
tion (19.39) by putting other functions of v
2
on the right-hand side, such as squares
and cross products, but we do not show these explicitly. Note that y
2
is exogenous if
and only if r
1
¼ 0.
Under the maintained assumptions, we have
Eðy
1
jz; y
2
; v
2
Þ¼expðz
1
d
1
þ y
2
g
1
þ v
2
r
1
Þð19:40Þ
and this equation suggests a strategy for consistently estimating d
1
, g
1
, and r
1
.Ifv
2
were observed, we could simply use this regression function in one of the QMLE
earlier methods (for example, Poisson, two-step negative binomial, or exponential).
Because these methods consistently estimate correctly specified conditional means,
we can immediately conclude that the QMLEs would be consistent. [If y
1
conditional
on ðz; y
2
; c
1
Þ has a Poisson distribution with mean in equation (19.37), then the dis-
tribution of y
1
given ðz; y
2
; v
2
Þ has overdispersion of the type (19.28), so the two-step
negative binomial estimator might be preferred in this context.]
To operationalize this procedure, the unknown quantities v
2
must be replaced with
estimates. Let
^
PP
2
be the L Â G
1
matrix of OLS estimates from the first-stage esti-
mation of equation (19.38); these are consistent estimates of P
2
. Define
^
vv
2
¼ y
2
À
z
^
PP
2
(where the observation subscript is suppressed). Then estimate the exponential
regression mod el using regressors ðz
1
; y
2
;
^
vv
2
Þ by one of the QMLEs. The estimates
ð
^
dd
1
;
^
gg
1
;
^
rr
1
Þ from this procedure are consistent using standard arguments from two-
step estimation in Chapter 12.
This method is similar in spirit to the methods we saw for binary response (Chapter
15) and censored regression models (Chapter 16). There is one di¤erence: here, we do
not need to make distributional assumptions about y
1
or y
2
. However, we do assume
that the reduced-form errors v
2
are independent of z. In addition, we assume that c
1
and v
2
are linearly related with e
1
in equation (19.39) independent of v
2
. Later we will
show how to relax these assumptions using a method of moments approach.
Because
^
vv
2
depends on
^
PP
2
, the variance matrix estimators for
^
dd
1
,
^
gg
1
, and
^
rr
1
should
generally be adjusted to account for this dependence, as described in Sections 12.5.2
and 14.1. Using the results from Section 12.5.2, it can be shown that estimation of P
2
does not a¤ect the asymptotic variance of the QMLEs when r
1
¼ 0, just as we saw
Chapter 19664
when testing for endogeneity in probit and Tobit models. Therefore, testing for
endogeneity of y
2
is relatively straightforward: simply test H
0
: r
1
¼ 0 using a Wald
or LM statistic. When G
1
¼ 1, the most convenient statistic is probably the t statistic
on
^
vv
2
, with the fully robust form being the most preferred (but the GLM form is also
useful). The LM test for omitted variables is convenient when G
1
> 1 because it can
be computed after estimating the null model ðr
1
¼ 0Þ and then doing a variable ad-
dition test for
^
vv
2
. The test has G
1
degrees of freedom in the chi-square distribution.
There is a final comment worth making about this test. The null hypothesis is the
same as Eðy
1
jz; y
2
Þ¼expðz
1
d
1
þ y
2
g
1
Þ. The test for endogeneity of y
2
simply looks
for whether a particular linear combination of y
2
and z appears in this conditional
expectation. For the purposes of getting a limiting chi-square distribution, it does not
matter where the linear combination
^
vv
2
comes from. In other words, under the null
hypothesis none of the assumptions we made about ðc
1
; v
2
Þ need to hold: v
2
need
not be independent of z, and e
1
in equation (19.39) need not be independent of
v
2
. Therefore, as a test, this procedure is very robust, and it can be applied when y
2
contains binary, count, or other discrete variables. Unfortunately, if y
2
is endoge-
nous, the correction does not work without something like the assumptions made
previously.
Example 19.2 (Is Education Endogenous in the Fertility Equation?): We test for
endogeneity of educ in Example 19.1. The IV for educ is a binary indicator for whether
the woman was born in the first half of the year ( frsthalf ), which we assume is ex-
ogenous in the fertility equation. In the reduced-form equation for educ, the coe‰-
cient on frsthalf is À:636 (se ¼ :104), and so there is a significant negative partial
relationship between years of schooling and being born in the first half of the year.
When we add the first-stage residuals,
^
vv
2
, to the Poisson regression, its coe‰cient is
.025, and its GLM standard error is .028. Therefore, there is little evidence against
the null hypothesis that educ is exogenous. The coe‰cient on educ actually becomes
larger in magnitude ðÀ:046Þ, but it is mu ch less precisely estimated.
Mullahy (1997) has shown how to estimate exponential models when some ex-
planatory variables are endogenous without making assumptions about the reduced
form of y
2
. This approach is especially attractive for dummy endogenous and other
discrete explanatory variables, where the linearity in equation (19.39) coupled with
independence of z and v
2
is unrealistic. To sketch Mullahy’s approach, write x
1
¼
ðz
1
; y
2
Þ and b
1
¼ðd
0
1
; g
0
1
Þ
0
. Then, under the model (19.37), we can write
y
1
expðÀx
1
b
1
Þ¼expðc
1
Þa
1
; Eða
1
jz; y
2
; c
1
Þ¼1 ð19:41Þ
Count Data and Related Models 665
If we assume that c
1
is independent of z—a standard assumption concerning
unobserved heterogeneity and exogenous variables—and use the normalization
E½expðc
1
Þ ¼ 1, we have the conditional moment restriction
E½y
1
expðÀx
1
b
1
Þjz¼1 ð19:42Þ
Because y
1
, x
1
,andz are all observable, condition (19.42) can be used as the basis
for generalized method of moments estimation. The function gðy
1
; y
2
; z
1
; b
1
Þ1
y
1
expðÀx
1
b
1
ÞÀ1, which depends on observable data and the parameters, is uncor-
related with any function of z (at the true value of b
1
). GMM estimation can be used
as in Section 14.2 once a vector of instrumental variables has been chosen.
An important feature of Mullahy’s approach is that no assumptions, other than the
standard rank condition for identification in nonlinear models, are made about the
distribution of y
2
given z: we need not assume the existence of a linear reduced form
for y
2
with errors independent of z. Mullahy’s procedure is computationally more
di‰cult, and testing for endogeneity in his framework is harder than in the QMLE
approach. Theref ore, we might first use the two-step quasi-likelihood method pro-
posed earlier for testing, and if endogeneity seems to be important, Mullahy’s GMM
estimator can be implemented. See Mullahy (1997) for details and an empirical
example.
19.5.2 Sample Selection
It is also possible to test and correct for sample selection in exponential regression
models. The case where selection is determined by the dependent variable bein g
above or below a known threshold requires full maximum likelihood meth ods using a
truncated count distribution; you are referred to the book by Cameron and Trivedi
(1998). Here, we assume that sample selection is related to an unobservable in the
population model
Eðy
1
jx; c
1
Þ¼expðx
1
b
1
þ c
1
Þð19:43Þ
where x
1
is a 1 Â K
1
vector of exogenous variables containing a constant, and c
1
is an
unobserved random variable. The full set of exogenous variables is x, and c
1
is inde-
pendent of x. Therefore, if a random sample on ðx
1
; y
1
Þ were available, b
1
could be
consistently estimated by a Poisson regression of y
1
on x
1
(or by some other QMLE)
under the normalization E½expðc
1
Þ ¼ 1.
A sample selection problem arises when a random sample on ðx
1
; y
1
Þ from the
relevant population is not available. Let y
2
denote a binary selection indicator, which
is unity if ðx
1
; y
1
Þ is observed and zero otherwise. We assume that y
2
is determined by
y
2
¼ 1½x
2
d
2
þ v
2
> 0, where 1½Á is the indicator function, x
2
is a subset of x (typi-
Chapter 19666
cally, x
2
¼ x), and v
2
is unobserved. This is a standard sample selection mechanism,
where y
2
and x
2
must be observable for all units in the population.
In this setting, sample selection bias arises when v
2
is correlated with c
1
. In partic-
ular, if we write equation (19.43) with a multiplicative error, y
1
¼ expðx
1
b
1
þ c
1
Þa
1
,
with Eða
1
jx; c
1
Þ¼1 by definition, we also assume that Eða
1
jx; c
1
; v
2
Þ¼Eða
1
Þ¼1.
In other words, selection may be correlated with c
1
but not a
1
. This model is similar
to the linear model with sample selection in Section 17.4.1 where the error in the
regression equation can be decomposed into two parts, one that is correlated with
v
2
ðc
1
Þ and one that is not ða
1
Þ.
To derive a simple correction, assume that ðc
1
; v
2
Þ is independent of x and bivari-
ate normal with zero mean; v
2
also has a unit variance, so that y
2
given x follows a
probit model. These assumptions imply that E½expðc
1
Þjx; v
2
¼E½expðc
1
Þjv
2
¼
expðr
0
þ r
1
v
2
Þ for parameters r
0
and r
1
. Provided x
1
contains a constant, we can use
the normalization expðr
0
Þ¼1, and we do so in what follows. Then Eðy
1
jx; v
2
Þ¼
expðx
1
b
1
þ r
1
v
2
Þ, and so by iterated expectations,
Eðy
1
jx; y
2
¼ 1Þ¼expðx
1
b
1
Þgðx
2
d
2
; r
1
Þð19:44Þ
where gðx
2
d
2
; r
1
Þ1 E½expðr
1
v
2
Þjv
2
> Àx
2
d
2
. By integrating the function expðr
1
v
2
Þ
against the truncated standard normal density conditional on v
2
> Àx
2
d
2
, it can be
shown that gð x
2
d
2
; r
1
Þ¼expðr
2
1
ÞFðr
1
þ x
2
d
2
Þ=Fðx
2
d
2
Þ, where FðÁÞ is the standard
normal cdf.
Given equation (19.44), we can apply a two-step method similar to Heckman’s
(1976) method for linear models that we covered in Chapter 17. First, run a probit of
y
2
on x
2
using the entire sample. Let
^
dd
2
be the probit estimator of d
2
. Next, on the
selected subsample for which ðy
1
; x
1
Þ is observed, use a QMLE analysis with condi-
tional mean function expðx
1
b
1
Þgðx
2
^
dd
2
; r
1
Þ to estimate b
1
and r
1
.Ifr
1
0 0, then, as
usual, the asymptotic variance of
^
bb
1
and
^
rr
1
should be adjusted for estimation of d
2
.
Testing r
1
¼ 0 is simple if we use the robust score test. This requires the derivative
of the mean function with respect to r
1
, evaluated at r
1
¼ 0. But qgðx
2
d
2
; 0Þ=qr
1
¼
lðx
2
d
2
Þ, where lðÁÞ is the usual inverse Mills ratio that appears in linear sample
selection contexts. Thus the derivative of the mean function with respect to r
1
, eval-
uated at all estimates under the null, is simply expðx
1
^
bb
1
Þlðx
2
^
dd
2
Þ. This result gives the
following procedure to test for sample selection: (1) let
^
bb
1
be a QMLE (for example,
the Poisson) using the selected sample, and define
^
yy
i1
1 expðx
i1
^
bb
1
Þ,
^
uu
i1
1 y
i1
À
^
yy
i1
,
and
~
uu
i1
1
^
uu
i1
=
ffiffiffiffiffiffi
^
yy
i1
p
for all i in the selected sample; (2) obtain
^
dd
2
from the probit of y
2
onto x
2
, using the entire sample; denote the estimated inverse Mills ratio for each
observation i by
^
ll
i2
; and (3) regress
~
uu
i1
onto
ffiffiffiffiffiffi
^
yy
i1
p
x
i1
,
ffiffiffiffiffiffi
^
yy
i1
p
^
ll
i2
using the selected
sample, and use N
1
R
2
u
as asymptotically w
2
1
, where N
1
is the number of observations
Count Data and Related Models 667
in the selected sample. This approach assumes that the GLM assumption holds under
H
0
. For the fully robust test, first regress
ffiffiffiffiffiffi
^
yy
i1
p
^
ll
i2
onto
ffiffiffiffiffiffi
^
yy
i1
p
x
i1
using the selected
sample and save the residuals,
~
rr
i1
; then regress 1 on
~
uu
i1
~
rr
i1
, i ¼ 1; 2; ; N
1
, and use
N
1
À SSR as asymptotically w
2
1
.
19.6 Panel Data Methods
In this final section, we discuss estimation of panel data models, primarily focusing
on count data. Our main interest is in models that contain unobserved e¤ects, but we
initially cover pooled estimation when the mod el does not explicitly contain an
unobserved e¤ect.
The pioneering work in unobserved e¤ects count data models was done by Haus-
man, Hall, and Griliches (1984) (HHG), who were interest ed in explaining patent
applications by firms in terms of spending on research and development. HHG devel-
oped random and fixed e¤ects models under full distributional assumptions. Wool-
dridge (1999a) has shown that one of the approaches suggested by HHG, which is
typically called the fixed e¤ects Poisson model, has some nice robustness properties.
We will study those here.
Other count panel data applications include (with response variable in parentheses)
Rose (1990) (number of airline accidents), Papke (1991) (number of firm births in an
industry), Downes and Greenstein (1996) (number of private schools in a public
school district), and Page (1995) (number of housing units shown to individuals). The
time series dimension in each of these studies allows us to control for unobserved het-
erogeneity in the cross section units, and to estimate certain dynamic relationships.
As with the rest of the book, we explicitly consider the case with N large relative to
T, as the asymptotics hold with T fixed and N ! y.
19.6.1 Pooled QMLE
As with the linear case, we begin by discussing pooled estimation after specifying a
model for a conditional mean. Let fðx
t
; y
t
Þ: t ¼ 1; 2; ; Tg denote the time series
observations for a random draw from the cross section population. We assume that,
for some b
o
A B,
Eðy
t
jx
t
Þ¼mðx
t
; b
o
Þ; t ¼ 1; 2; ; T ð19:45Þ
This assumption simply means that we have a correctly specified parametric model
for Eðy
t
jx
t
Þ. For notational convenience only, we assume that the function m itself
does not change over time. Relaxing this assumption just requires a notational
Chapter 19668
change, or we can include time dummies in x
t
. For y
t
b 0 and unbounded from
above, the most common conditional mean is expðx
t
bÞ. There is no restriction on the
time dependence of the observations under assumption (19.45), and x
t
can contain
any observed variables. For example, a static model has x
t
¼ z
t
, where z
t
is dated
contemporaneously with y
t
. A finite distribute d lag has x
t
containing lags of z
t
. Strict
exogeneity of ð x
1
; ; x
T
Þ, that is, Eðy
t
jx
1
; ; x
T
Þ¼Eðy
t
jx
t
Þ, is not assumed. In
particular, x
t
can contain lagged dependent variables, although how these might ap-
pear in nonlin ear models is not obvious (see Wooldridge, 1997c, for some possibil-
ities). A limitation of model (19.45) is that it does not explicitly incorporate an
unobserved e¤ect.
For each i ¼ 1; 2; ; N; fðx
it
; y
it
Þ: t ¼ 1; 2; ; Tg denotes the time series obser-
vations for cross section unit i. We assume random sampling from the cross section.
One approach to estimating b
o
is pooled nonlinear least squares, which was intro-
duced in Problem 12.6. When y is a count variable, a Poisson QMLE can be used.
This approach is completely analogous to pooled probit and pooled Tobit estimation
with panel data. Note, however, that we are not assuming that the Poisson distribu-
tion is true.
For each i the quasi–log likelihood for pooled Poisson estimation is (up to additive
constants)
l
i
ðbÞ¼
X
T
t¼1
fy
it
log½mðx
it
; bÞ À mðx
it
; bÞg1
X
T
t¼1
l
it
ðbÞð19:46Þ
The pooled Poisson QMLE then maximizes the sum of l
i
ðbÞ across i ¼ 1; ; N.
Consistency and asymptotic normality of this estimator follows from the Chapter 12
results, once we use the fact that b
o
maximizes E½l
i
ðbÞ; this follows from GMT
(1984a). Thus pooled Poisson estimation is robust in the sense that it consistently
estimates b
o
under assumption (19.45) only.
Without further assumptions we must be careful in estimating the asymptotic
variance of
^
bb. Let s
i
ðbÞ be the P Â 1 score of l
i
ðbÞ, which can be written as s
i
ðbÞ¼
P
T
t¼1
s
it
ðbÞ, where s
it
ðbÞ is the score of l
it
ðbÞ; each s
it
ðbÞ has the form (19.12) but
with ðx
it
; y
it
Þ in place of ðx
i
; y
i
Þ.
The asymptotic variance of
ffiffiffiffiffi
N
p
ð
^
bb À b
o
Þ has the usual form A
À1
o
B
o
A
À1
o
, where
A
o
1
P
T
t¼1
E½‘
b
m
it
ðb
o
Þ
0
‘
b
m
it
ðb
o
Þ=m
it
ðb
o
Þ and B
o
1 E½s
i
ðb
o
Þs
i
ðb
o
Þ
0
. Consistent
estimators are
^
AA ¼ N
À1
X
N
i¼1
X
T
t¼1
‘
b
^
mm
0
it
‘
b
^
mm
it
=
^
mm
it
ð19:47Þ
Count Data and Related Models 669