actually work in practice.) Hence I constructed a data set consisting of 1000 obser-
vations with the following variables: a variable ε that is normally distributed with
mean 0 and variance 4, a variable u that is normally distributed with mean 0 and vari-
ance 1, a variable X that is normally distributed with mean 3 and variance 1.75 but
is assumed to be fixed over repeated sampling, a variable W equal to 1 ⫹ .75X ⫹ u,
and a variable Y equal to ⫺2 ⫹ 1.5X ⫹ ε. Moreover, ε and u were created so that their
joint distribution is bivariate normal, and their correlation, ρ, is .707. Otherwise, ε
and u are both uncorrelated with X.
What we essentially have, then, is a random sample of n ⫽ 1000 observations on
two variables of primary interest: X and Y. Moreover, Y follows a linear regression
on X with an intercept of ⫺2 and a slope of 1.5. The conditional error, ε, is normally
distributed with a mean of zero and a variance of 4. The unconditional mean of Y
is E(Y) ⫽ E(⫺2 ⫹ 1.5X ⫹ ε) ⫽⫺2 ⫹ 1.5E(X) ⫹ E(ε) ⫽⫺2 ⫹ 1.5(3) ⫹ 0 ⫽ 2.5. The
unconditional variance of Y can be recovered by noting that if Y ⫽⫺2 ⫹ 1.5X ⫹ ε,
then V(Y ) ⫽ V(⫺2 ⫹ 1.5X ⫹ ε) ⫽ V(⫺2 ⫹ 1.5X) ⫹ V(ε) ⫽ (1.5)
2
V(X) ⫹ V(ε) ⫽ 2.25
(1.75) ⫹ 4 ⫽ 7.9375. Moreover, the contribution due to the structural part of the model,
the linear predictor, is 2.25(1.75) ⫽ 3.9375, which represents 49.6% of the total vari-
ance of Y. Hence, P
2
, the population R
2
for the regression, is .496. The regression
model for Y is one of the equations we will be trying to estimate in what follows.
To create truncated and censored data, I proceeded as follows. To truncate or cen-
sor 40% of the observations on Y, I created a variable Y* ⫽ Y ⫺ c, where c is the value
(1.859) representing the 40th percentile of the sample distribution on Y. This simply
sets the truncation point and censoring thresholds at zero rather than at c. Then I cre-
ated a truncated version of Y, Y
t
, by setting Y
t
to missing when Y* ⱕ 0, and setting
Y
t
⫽ Y* otherwise. To create a censored version of Y, Y
c
, I set Y
c
to 0 whenever Y* ⱕ 0,
and I set Y
c
⫽ Y* otherwise. Notice that this changes the underlying regression model
slightly. By subtracting 1.859 from both sides of the equation for Y, we see that Y* ⫽
⫺3.859 ⫹ 1.5X ⫹ ε. This is the equation we are trying to estimate in the truncated and
censored regression examples. In the sample, we have 601 observations on Y
t
, with
the other 399 observations missing on Y
t
. We have 1000 observations on Y
c
, but for
399 of them the value of Y
c
is zero. For the other 601 observations, Y
c
takes on posi-
tive values.
Incidentally truncated data were created in a somewhat different manner. First, I
chose a new value of c (3.634) that constituted the 60th percentile for the sample dis-
tribution of W. I then constructed a variable Z* equal to W ⫺ 3.634. In this case, Z*
is less than or equal to zero for 60% of the cases. I then created a dummy variable,
Z, equal to 0 whenever Z* ⱕ 0, and equal to 1 otherwise. I then set Y to missing if
Z ⫽ 0; otherwise, Y is left as is. In the current example, then, only 400 cases have
valid scores on Y. The variable Y here is said to be incidentally truncated. I refer to
Z* as the selection propensity. When Z* is above zero, the case is selected into the
current sample (of responses) and we observe Y. Otherwise, we do not observe Y for
that case, although we do observe X. This is the model employed to understand and
correct for self-selection bias (more on this below). Recall that Y has a mean of 2.5
and a variance of 7.9375. What does incidental truncation do to the population mean
and variance of the truncated response? As before, I draw on a result presented in
320 TRUNCATED AND CENSORED REGRESSION MODELS