Alfred DeMaris - Regression with Social Data, Modeling Continuous and Limited Response Variables

Подождите немного. Документ загружается.

This expression, read

“the integral of



σ 兹

苶



exp

冤





冢



t



冣

冥

from negative inﬁnity to x,”

indicates the area under the normal curve to the left of x. As there is no closed-form

solution to this integral, it must be approximated using numerical techniques.

An especially important normal distribution is the standard normal distribution,

which has a mean of zero and a standard deviation of 1. Given the parameters µ and

σ, any normally distributed variable X can be made to have the standard normal distri-

bution simply by converting X to standard-score form via the formula Z  (X  µ)/σ.

The density and distribution functions for the standard normal distribution have a spe-

cial notation: φ(z) for the density function and Φ(z) for the distribution function. The

functions are

φ(z) 



兹

苶



(1/ 2)z

Φ(z)  冕

∞



兹

苶



(1/ 2)t

dt.

Because areas under the standard normal density curve have been computed and

tabled, it is a simple matter to ﬁnd P(a  X  b) for any normally distributed variable,

X. We simply convert the problem into a comparable problem involving areas under

the standard normal curve. For example, suppose that X is normally distributed with

mean 3 and standard deviation 1.5. If we sample one observation from this distribu-

tion, what is the probability that its value will be between 4.9 and 5.5? Now if we

were to standardize all of the X-values by subtracting 3 and dividing by 1.5, the result-

ing variable, Z, would have the standard normal distribution. The values 4.9 and 5.5,

also converted to Z-scores, would be (4.9  3)/1.5  1.27 and (5.5  3)/1.5  1.67.

Thus, the problem becomes: What is the probability of a standard normal variable

being between the values of 1.27 and 1.67? As in the example of the exponential dis-

tribution above, P(1.27  X  1.67)  F(1.67)  F(1.27), or, using the special nota-

tion for the standard normal distribution function, Φ(1.67)  Φ(1.27). Using a

standard normal table, we have Φ(1.67)  .9525, while Φ(1.27)  .898. The answer

is, therefore, .9525  .898  .0545.

Expectation, Variance, Covariance, and Correlation. Several characteristics of the

distributions of variables in the population are important in statistical analyses. I dis-

cuss four in this section: expectation, variance, covariance, and correlation.

Expectation. The expected value of X, denoted E(X), is the mean of X in the popu-

lation. For a discrete variable, X,

E(X) 

冱

xf(x).

APPENDIX: STATISTICAL REVIEW 25

c01.qxd 27.8.04 15:35 Page 25

That is, the population mean consists of a weighted sum (see Section II.D in Appendix

A for an explanation of weighted sums) of the X-values, where the weights are the den-

sities, or probabilities, associated with each value. For a continuous variable, X,

E(X)  冕

xf(x) dx.

In this case, the mean is the integral of x times f(x) over the range of X-values. This

is simply the continuous counterpart to the deﬁnition of E(X) for discrete variables.

In either case, E(X) is just the population mean, or arithmetic average, of X.

Variance. The variance of X, denoted V(X), is deﬁned as V(X)  E[X  E(X)]

. The

variance is the mean-squared deviation of X from its expected value and indicates

how spread out the X-values are around their mean. The standard deviation, denoted

SD(X), is the square root of the variance and is loosely interpreted as the average dis-

tance from the mean of the values in the distribution.

Covariance. The deﬁnition of covariance is given elsewhere (see Section III.A of

Appendix A) but will be repeated here. The covariance of any two variables, X and

Y, is Cov(X,Y)  E(X  µ

)(Y  µ

), where µ

 E(X) and µ

 E(Y ). The covariance

is the average cross-product of X, deviated from its mean, times Y, deviated from its

mean. The covariance measures the extent to which two variables vary together.

Positive covariances suggest that higher values of X are associated with higher val-

ues of Y. Negative covariances, on the other hand, indicate that higher values of X

are associated with lower values of Y.

Correlation. As a descriptive measure of joint variation, the covariance is inade-

quate, since its value depends on the units of measurement of X and Y. A better

descriptive measure is the correlation coeﬃcient, denoted ρ

. Its formula is





兹

苶

(

苶

(

)

苶

)

苶



The correlation is a “standardized” covariance, constructed so as to fall in the interval

[1,1], with an absolute value of 1 indicating perfect correlation and a value of zero

indicating no correlation. Another way to view the correlation is that it is the covari-

ance between two standardized variables. Covariance and correlation are designed to

detect linear association. Hence, either may be zero if two variables vary together in a

systematic but nonlinear fashion.

Sampling and Sampling Distributions. Most of the science of statistics is concerned

with making inferences about a population based on studying only a small subset of

the population, the sample. Samples come in two major “ﬂavors”: probability and

nonprobability. Probability samples are those in which each member of the popula-

tion has some known a priori probability of being selected into the sample.

Nonprobability samples, as the name suggests, are those in which the probability of

given population members being selected into the sample cannot be speciﬁed ahead

26 INTRODUCTION TO REGRESSION MODELING

c01.qxd 27.8.04 15:35 Page 26

of time. Inferential statistics is based solely on probability sampling. It is the only

type of sampling that lends itself to theoretical speciﬁcation of the sampling distribu-

tions of statistics (discussed below), which form the basis of statistical inference. The

simplest type of probability sample is the simple random sample, in which each mem-

ber of the population has the same chance of being selected into the sample. If n cases

are to be selected from a population of size N, each population member has a proba-

bility of selection of n/N.

Sample statistics such as the sample mean and variance of a variable (y

苶

and s

respectively) or a sample regression coeﬃcient (b) indicating the eﬀect of a predictor

on an outcome are estimates of corresponding population parameters. Let θ denote

any population parameter, and θ

the sample estimator of that parameter. In making

inferences about θ based on the observed value of θ

, we need to understand the nature

of the relationship between the two. The sampling distribution of a statistic is critical

to this enterprise: It is a probability distribution for a sample statistic. That is, it is an

enumeration of all possible values of θ

, together with their associated probabilities of

occurrence, that would be obtained through an inﬁnite repetition of collecting sam-

ples of size n from that population and recomputing θ

. Although we collect only one

sample and compute one value of θ

in practice, it is important to understand that the

full distribution of θ

could be generated for any statistic via repeated sampling. The

importance of this distribution is that it indicates the probability that θ

is within a

speciﬁed “distance” from θ. It therefore places bounds on the degree to which we are

in error in using θ

as an estimate of θ or in using θ

to test a hypothesis about θ.

Table 1A.1 presents a very simple illustration of the sampling distributions for

the sample mean, y

苶

, and the sample variance, s

. As is evident in the table, the “pop-

ulation” consists of only ﬁve observations: A, B, C, D, and E. (The population is

artiﬁcially small to keep the number of diﬀerent samples manageable.) For each

observation, a value is recorded for the variable Y. The mean of Y, or µ, for this

population is 3 (as is easily veriﬁed), and the variance of Y, or σ

, is 2. [This is also

easily veriﬁed, keeping in mind that for the population, the variance of Y is σ



冱(Y  µ)

/N, where N is the population size.] If we draw samples from this popula-

tion of n  3, without replacement, there are 10 diﬀerent possible samples that can be

drawn. These are shown in the table along with the Y-values of the sample members

and the sample mean and variance for each sample. The RF columns indicate the rel-

ative frequency of occurrence of each value of the sample mean and variance, respec-

tively. These columns represent the sampling distributions of each statistic, since they

indicate the probabilities associated with each diﬀerent value of the sample statistics.

For the sample mean, it is clear that when drawing a sample of size 3 from this

population, certain values of y

苶

—such as 3.33, 3, and 2.67—are twice as likely as

other values. Similarly, the most likely value for σ

is 2.33. We can also compute the

average of the 10 sample means, denoted E(y

苶

). We ﬁnd that it is 3, the same as the

population mean of Y. This is no accident, since it is always true that E(y

苶

)  µ. This

means that the sample mean is an unbiased estimator of the population mean—its

average value equals the population parameter. The average sample variance, or

E(s

), is 2.5, which in this case is not equal to the population variance of 2. However,

under ordinary sampling conditions with inﬁnite (or approximately inﬁnite) popula-

tions, it is the case that E(s

)  σ

. With ﬁnite populations we must apply a ﬁnite

APPENDIX: STATISTICAL REVIEW 27

c01.qxd 27.8.04 15:35 Page 27

population correction to the sample variance to make it an unbiased estimator of σ

That correction is (N  1)/N. Thus,

冢



 1



冣





 1



E(s

)  σ

In this example, we see that

冢



 1



冣

 E

冢



5 



冣





E(s

) 



(2.5)  2  σ

Sampling from a Population vs. Sampling to a Population. Probability sampling

from a target population allows inferences from the sample back to that target popu-

lation. As we will see below, this is because probability sampling allows us to spec-

ify the sampling distributions of sample statistics and their relation to population

parameters. What inferences can be made with nonprobability samples? Some would

suggest that no inferences are possible. Nevertheless, as the reader is fully aware,

researchers use nonprobability samples to make population inferences all the time.

Data are frequently taken from convenience samples of students or members of vol-

untary organizations, analyses are performed, and results are discussed in terms of

statistical signiﬁcance or nonsigniﬁcance. Or, experimenters randomly assign volun-

teers to treatment and control groups and then outline which group diﬀerences are

28 INTRODUCTION TO REGRESSION MODELING

Table 1A.1 Sampling Distribution for the Sample Mean and Sample Variance, Based

on Repeated Sampling of Size n



3 from a Population with Five Cases

Population Elements

Case Y

Sampling Distribution (for n  3)

Sample Members Y y

苶

RF s

1 A,B,C 2,3,1 2 .1 1 .3

2 A,B,D 2,3,5 3.33 .2 2.33 .4

3 A,B,E 2,3,4 3 .2 1 .3

4 A,C,D 2,1,5 2.67 .2 4.33 .2

5 A,C,E 2,1,4 2.33 .1 2.33 .4

6 A,D,E 2,5,4 3.67 .1 2.33 .4

7 B,C,D 3,1,5 3 .2 4 .1

8 B,C,E 3,1,4 2.67 .2 2.33 .4

9 B,D,E 3,5,4 4 .1 1 .3

10 C,D,E 1,5,4 3.33 .2 4.33 .2

c01.qxd 27.8.04 15:35 Page 28

“signiﬁcant.” In either case, inferences are being made to some larger population of

cases. The most conservative position would be that such “inferences” are meaning-

less since the data were not drawn in a probabilistic fashion from a known popula-

tion. However, it is unlikely that any given sample, however haphazardly selected, is

not representative of a larger group of cases. The question is: What larger group?

Some idea of the population represented by a nonprobability sample can be

inferred from the concept of repeated sampling. For example, suppose that I wish to

do an opinion survey about an important issue on my campus. To collect a sample,

I go to the university library when it opens on Saturday morning and sample the ﬁrst

30 students who come in. Clearly, this is not a probability sample from the popula-

tion of students at the university, since only certain students patronize the library,

especially on Saturday morning. Moreover, even if we identify the population of stu-

dents who ever patronize the library on Saturday, this sample has not been selected

randomly from that population. However, suppose that we repeat this sampling tech-

nique next Saturday, and the Saturday after that, and so on, indeﬁnitely. Ultimately,

this approach would generate a “population” of cases consisting of all unique stu-

dents encompassed by the collection of all such samples. The ﬁrst (and typically,

only) sample we take could then be considered a random sample from this popula-

tion. I refer to this scenario as sampling to a population rather than sampling from a

population. This, of course, is a hypothetical population whose complexion could

only be hinted at from the sociodemographic makeup of the sample. Nevertheless, it

does represent a larger group to whom results from nonprobability samples might be

generalized and for whom statistically “signiﬁcant” results might apply.

Parameter Estimation and Statistical Inference

Sample statistics such as y

苶

, s

, or b are estimates of corresponding population param-

eters. In this section I discuss the desirable properties of estimators and common

techniques for constructing estimators. I also consider inferential procedures for

parameters, such as interval estimation and hypothesis tests. (Readers unfamiliar

with diﬀerential calculus may want to review Section IV of Appendix A before con-

tinuing with this section.)

Least Squares. There are a number of techniques for constructing sample estima-

tors of parameters, but one of the most commonly used is the technique of least

squares. The least squares estimator of a parameter θ is that number, θ

, that results

in the smallest amount of prediction error when θ

is used in predicting the individ-

ual data values—the X’s. Prediction error is measured by the squared distance of θ

[or g(θ

), a function involving θ

] from the data values (thus the appellation “least

squares”). Hence, the least squares estimator of θ is the θ

that minimizes 冱(x  θ

)

(or, more generally, 冱[x  g(θ

)]

). To illustrate, suppose that we draw a random sam-

ple from a normally distributed population of X-values whose population mean is µ.

To estimate µ, we reason as follows. The best estimate should be the single number

that is “closest” to the collection of sample X’s. We use as our measure of closeness

the squared distance from that number to each X-value. Thus, we take as our esti-

mator of µ the value µ

that minimizes

冱

i1

 µ

)

. The solution, from calculus,

APPENDIX: STATISTICAL REVIEW 29

c01.qxd 27.8.04 15:35 Page 29

involves ﬁnding the value of µ

that makes the ﬁrst derivative of this sum, with respect

to µ

, equal to zero:



冤

冱

(x  µ

)

冥

2

冱

(x  µ

2

冱

(x  µ

)  0

冱

(x  µ

)  0

冱

x 

冱

 0

冱

x  nµ





冱



Thus, the least squares estimator of the population mean is the sample mean, X

苶

Maximum Likelihood. Another estimation technique is the one that we will employ

for many of the models presented in this book. In estimation via maximum likeli-

hood, we take θ

as the value of θ, among all possible values of θ, that would have

rendered the sample data most likely to be observed. To illustrate this technique, I

employ the following example. Suppose that X, a positive continuous variable, has

an exponential distribution, with parameter λ, where λ  0. The density function for

the ith case is therefore f(x

)  λe

λx

. If we sample n observations randomly from

this population, the joint density function for the n observations on X is the product

of the n individual density functions. This follows from the fact that the observations

are independent, and the rule (see above) that the probability of independent events

is the product of their individual probabilities (densities generally follow the same

probability rules as probabilities). So if we let the vector x represent the collection

of particular X-values observed in our sample, the joint density of x is

f(x) 

兿

i1

λe

λx

Given the value of λ, the joint density of x allows us to calculate the density associ-

ated with any particular collection of X-values. On the other hand, given a particular

collection of X-values, f(x) is actually a function of λ. Viewed from this perspective,

we can ask: Which value of λ makes f(x) as large as possible and therefore makes

30 INTRODUCTION TO REGRESSION MODELING

c01.qxd 27.8.04 15:35 Page 30

the data most likely to have been observed? The function is then referred to as the

likelihood function for the parameter, given the data, and denoted L(λ冟 x). The value

of lambda that maximizes it is called the maximum likelihood estimate of λ and

denoted λ

Now it turns out that whatever maximizes the likelihood function also maximizes

the log of the likelihood function. As the log-likelihood, denoted ln L(λ 冟 x) or ᐉ(λ 冟 x),

is more mathematically tractable than the likelihood, we seek to maximize this quan-

tity. That is, we maximize

ᐉ(λ 冟 x)  ln

冢

兿

i1

λe

 λx

冣

 ln

冢

λ冱

冣

 n lnλ λ

冱

To ﬁnd the λ that maximizes this quantity, we compute the ﬁrst derivative of this

function with respect to λ and then solve for the λ that makes it equal to zero. Thus,



(ᐉ(λ 冟 x)) 



冢

n ln λ  λ

冱

冣







冱

and





冱

x  0 whenever





冱

λ 



冱



Thus, n/冱x is the maximum likelihood estimate (MLE) for λ. Now by the invariance

property of MLEs (Bickel and Doksum, 1977), if θ

is the MLE for θ, then g(θ

) is the

MLE for g(θ), for any function g(). In that the mean of an exponentially distributed

variable is 1/λ, the MLE for the mean of X in these data is

(X) 







n兾

冱x







冱



 x

苶

Here we see that the sample mean of the X’s, x

苶

, is the maximum likelihood estima-

tor of the population mean for exponentially distributed data.

Asymptotic Properties of MLEs. Maximum likelihood estimators have several

asymptotic, or large-sample, properties that make them especially desirable estimators

(Bollen, 1989). First, they are asymptotically unbiased. What does this mean? The bias

of an estimator θ

for a parameter θ, denoted B(θ

), is B(θ

)  E(θ

)  θ. An estimator θ

is unbiased for θ if E(θ

)  θ. That is, an estimator is unbiased for a parameter if its

average value, taken over its sampling distribution, equals the parameter. MLEs have

APPENDIX: STATISTICAL REVIEW 31

c01.qxd 27.8.04 15:35 Page 31

this property asymptotically; that is, as n tends to inﬁnity, E(θ

) converges to θ. Second,

they are consistent. Consistency is a very important property of estimators. We say that

is consistent for θ if and only if lim

n→∞

P(冟 θ



θ冟 ε)  1 for every ε 0. This for-

mulation says that a consistent estimator is one for which the probability that it is arbi-

trarily close to the parameter converges to 1 as the sample size increases without

bound. Intuitively, this means that the sampling distribution of θ

becomes more and

more concentrated over θ as n increases, so that ultimately, as n tends to inﬁnity, the

sampling distribution is entirely concentrated in a “spike” centered directly over the

parameter. Consistent estimators are not necessarily unbiased in small samples, but

the analyst is assured that as he or she uses a larger and larger sample, the estimator is

getting closer and closer in value to the parameter of interest. Also, if θ

is consistent

for θ, we say that the probability limit, or plim, of θ

is θ (Greene, 2003). An important

theorem connected to the probability limit is the Slutsky theorem: For a continuous

function g(θ

) that is not a function of n, plim g(θ

) = g(plim θ

) = g(θ),forθ

a consistent

estimator of θ (Greene, 2003). As an example, suppose that µ is the population mean

of a continuous variable, Y, and we wish to estimate ln µ. In that x

苶

is consistent for µ,

our consistent estimator of ln µ is ln x

苶

, since plim (ln x

苶

)  ln (plim x

苶

)  ln µ. Third,

MLEs are asymptotically eﬃcient, which means that for large samples they achieve the

smallest sampling variance among consistent estimators. Finally, they are asymptoti-

cally normal. This means that as n tends to inﬁnity, their sampling distributions

become more and more normal, enabling statistical inferences using the standard nor-

mal distribution.

Sampling Distributions of Estimators. We have seen a simple example above of

sampling distributions for the sample mean and sample variance. These were gener-

ated by simple enumeration, since the population was artiﬁcially small. In practice,

populations are inﬁnite, or nearly so, and enumerating all possible samples is not

feasible. However, various limit theorems in statistics can be drawn on to specify the

sampling distributions of many sample statistics under certain conditions. For exam-

ple, the central limit theorem (CLT) holds that a weighted sum of random variables

is normally distributed as n tends to inﬁnity, regardless of the distributions of the

original variables (Hoel et al., 1971). That is, the CLT maintains that if X

for i  1,

2, . . . , n are independent, identically distributed random variables each having mean

µ and variance σ

, and w

,i 1, 2, . . . , n, constitute a set of constants, the sum



冱

i1

converges in distribution to a normal distribution as n tends to inﬁnity,

regardless of the distribution of the X

. Moreover, E(S

)  µ

冱

i1

and V(S

) 

冱

i1

. As an example of the application of this theorem, consider the distribu-

tion of X

苶

, the sample mean of the X

, where E(X

)  µ and V(X

)  σ

. Now X

苶



冱

i1

/n, which is in the form

冱

i1

, where w

 1/n for all i. For large n (which

in this case means an n of about 30 or more; see Agresti and Finlay, 1997), accord-

ing to the CLT, X

苶

is approximately normally distributed, with mean equal to

冱

i1

 µ

冱

i1



 µ



 µ .

32 INTRODUCTION TO REGRESSION MODELING

c01.qxd 27.8.04 15:35 Page 32

Therefore, X

苶

is an unbiased estimator of µ. Moreover,

V(X

苶

)  σ

冱

i1

 σ

冱

i1



 σ







Thus, the standard deviation, or standard error, of the sample mean is σ/兹n

苶

Knowing the shape and parameters (e.g., the mean and variance) of the sampling

distribution of a statistic enables inferences to be made about the corresponding

population parameter. Statisticians conduct two major types of inferences: interval

estimation and hypothesis testing.

Interval Estimation. The sample mean, X

苶

, is a point estimate of the population

parameter, µ. However, we would not expect X

苶

to actually equal µ. In fact, given

that X

苶

is a continuous variable in large samples (since it has a continuous density—

the normal), we know that P(X

苶

 µ)  0. On the other hand, we can use the sam-

pling distribution of X

苶

to construct an interval of numbers within which we can be

highly conﬁdent that µ falls. For example, a 95% conﬁdence interval for µ is an

interval of numbers of the form (a,b) that we are 95% conﬁdent contains the true

value of µ. This means that if we were to engage in repeated sampling of size n

from the population of X-values and construct this interval based on each sample,

95% of all such intervals would contain µ. The formula for the 95% large-sample

conﬁdence interval for µ is

苶



1.96



兹

苶



where s is the sample standard deviation of X. This formula has the following

justiﬁcation. Assuming a random sample of n  30 from the population of X-values,

we know by the CLT that X

苶

is normally distributed with mean µ and standard devi-

ation σ/兹n

苶

. This means that the variable Z  (X

苶

 µ ) /(σ/兹n

苶

) has the standard nor-

mal distribution. We also know that for any variable Z with a standard normal

distribution, P(1.96  Z  1.96)  .95. Putting these two facts together, we have

that

冢

1.96 



苶

兾



兹

苶



 1.96

冣

 .95.

Notice that we have an inequality inside the probability statement. At this point, a

brief word is appropriate regarding inequalities. Adding or subtracting the same

number from all sides of an inequality does not change the inequality. Moreover,

multiplying or dividing by the same positive number does not change the inequality.

However, multiplying or dividing by a negative number reverses the inequality. With

these principles in mind, we manipulate the inequality inside the probability state-

ment above so as to isolate µ in the center of the inequality:

冢

1.96 



苶

兾



兹

苶



 1.96

冣

 .95

APPENDIX: STATISTICAL REVIEW 33

c01.qxd 27.8.04 15:35 Page 33

implies that

冢

1.96



兹

苶



 X

苶

 µ  1.96



兹

苶



冣

 .95

冢

1.96



兹

苶



 µ  X

苶

1.96



兹

苶



冣

 .95

冢

苶

 1.96



兹

苶



 µ  X

苶

 1.96



兹

苶



冣

 .95,

or, reorienting the expression,

冢

苶

 1.96



兹

苶



 µ  X

苶

 1.96



兹

苶



冣

 .95.

Thus, employing the CLT we have arrived at a formula for creating an interval of

numbers that we can be 95% conﬁdent contains the true value of µ. In that s is a con-

sistent estimator of σ, we substitute s for σ in this formula when computing the inter-

val using a large sample.

Hypothesis Tests. Although conﬁdence intervals are extremely informative, social

and behavioral scientists tend to rely more on hypothesis testing to make inferences

about population parameters. A hypothesis is a tentative statement about the value

of one or more population parameters. In fact, we typically pose two competing

hypotheses. The research or alternative hypothesis is the one the researcher usually

believes to be true. We marshal evidence in favor of the research hypothesis by

showing that the sample evidence is inconsistent with a contrary hypothesis, the null

hypothesis. Together, the null and research hypotheses are mutually exclusive state-

ments that cover the entire parameter space. For example, suppose we believe that

the population mean of a continuous variable X is greater than 5. To test this idea,

we pose the opposite: that the population mean is, at most, equal to 5. If we can show

that the empirical (i.e., sample) evidence is inconsistent with this opposite position,

we tend to accept the research hypothesis. Formally, the hypotheses are as follows.

The null hypothesis is H

: µ  5; the research hypothesis is H

: µ  5. Notice that

the hypotheses are mutually exclusive as well as exhaustive: They subsume all pos-

sible values that the parameter can take.

There are two ways to proceed with a test. The ﬁrst is to decide in advance how

willing we are to make an incorrect decision against H

when it is, in fact, true.

Suppose that we are willing to take a 5% chance of being wrong when we reject H

This .05 probability of being wrong is called the alpha level for the test, denoted α.

It is the probability of rejecting H

when it is true for a particular test. This would be

an incorrect decision and is called a type I error. Suppose, further, that we have taken

34 INTRODUCTION TO REGRESSION MODELING

c01.qxd 27.8.04 15:35 Page 34