MODELING COUNT RESPONSES WITH POISSON REGRESSION 359
within the range of 0 to 1, and it is nondecreasing as regressors are added to a model.
The first measure I consider is the likelihood-ratio index, also considered for logis-
tic regression models. The formula in count models is essentially the same:
R
2
L
⫽ 1 ⫺
ᎏ
l
l
n
n
L
L
1
0
ᎏ
,
where L
1
is the likelihood for the estimated model, evaluated at its MLEs, and L
0
is
the likelihood for the intercept-only model. This measure is nondecreasing as pre-
dictors are added, since the likelihood never decreases as parameters are added to the
model. (If the likelihood could decrease with addition of predictors, ln L
1
, which
is a negative value, could become larger in magnitude, implying a smaller R
2
L
.) In
logistic regression, this measure is also bounded by 0 and 1. However, in count mod-
els, this measure cannot attain its upper bound of 1 (Cameron and Trivedi, 1998),
and so may underestimate the discriminatory power of any particular model.
A second analog employed by some statisticians (Land et al., 1996) is the corre-
lation between Y and its predicted value according to the model. Recall that this
gives us the R
2
for linear regression. Hence, this measure is r
2
⫽ [corr( y,µ
ˆ
)]
2
.
Although this measure is bounded by 0 and 1, it is not necessarily nondecreasing
with the addition of parameters. The advantage to these first two measures, on the
other hand, is that they are readily calculated from output produced by count-model
software.
The third measure, proposed by Cameron and Windmeijer (1997), is the deviance
R
2
. It is defined as follows. First, we define the Kullback–Leibler (KL) divergence,
a measure of the discrepancy between two likelihoods. Let y be the vector of
observed counts and
µµ
ˆ
be the vector of predicted counts based on a given model.
Further, let 艎(
µµ
ˆ
0
,y) be the log-likelihood for the intercept-only model, 艎(
µµ
ˆ
,y) the
log-likelihood for the hypothesized model, and 艎(y,y) the maximum log-likelihood
achievable. This last would be the log-likelihood for a saturated model, one with as
many parameters as observations. Then the KL divergence between saturated and
intercept-only models, K(y,
µµ
ˆ
0
), equals 2[艎(y,y) ⫺ 艎 (
µµ
ˆ
0
,y)]. This represents an esti-
mate of the information on y, in sample data, that is “potentially recoverable by
inclusion of regressors” (Cameron and Windmeijer, 1997, p. 333) and corresponds
to the TSS in linear regression. The information on y that remains after regressors are
included in the model is the KL divergence between saturated and fitted models,
K(y,
µµ
ˆ
), which is equal to 2[艎(y,y)⫺ 艎(
µµ
ˆ
,y)]. This is analogous to the SSE in linear
regression. Finally, the deviance R
2
is
R
2
D
⫽1⫺
.
The reader should recognize that the right-hand side of R
2
D
is analogous to
1 ⫺ SSE/TSS, the R
2
in linear regression. In this application, however, R
2
D
does not
have an “explained variance” interpretation. Rather, it is “the fraction of the maximum
potential likelihood gain (starting with a constant-only model) achieved by the fitted
model” (Cameron and Windmeijer, 1997, p. 338). R
2
D
possesses both of the other
properties of a desirable R
2
analog: It is bounded by 0 and 1 and it is nondecreasing
K(y,
µµ
ˆ
)
ᎏ
K(y,
µµ
ˆ
0
)