4.9 Chi-square tests for nominal scale data
When sorting members of a population into categories, we first specify a nominal
variable, i.e. a non-numeric data type, whose values are the mutually exclusive and
exhaustive categories. The nominal variable has n ≥ 2 number of categories, where n is
an integer. For example, we can create a nominal variable that specifies a mutually
exclusive physical attribute such as the inherited hair color of a person. The four
categories of hair color could be defined as “blonde,” “red,” “brown,” and “black.”
To estimate the proportion of each category within the population, a random sample
is chosen from the population, and each individual is classified under one of n possible
categories. The proportion of the population that belongs to each category is estimated
from frequency counts in each category. In Section 4.7, we looked at the relationship
between population proportion and the binomial probability distribution, where a
dichotomous nominal variable was used for classification purposes. We used the
frequencies observed in the “success” category to calculate the sample proportion. If
the number of frequencies is large enough, the sample proportion is a good estimate of
the population proportion and the central-limit theorem applies. Under these condi-
tions, the sampling distribution of the population proportion is approximately normally
distributed, and the z test can be used to test hypotheses for a single population
proportion, or for the equality of proportions within two populations.
In this section, we introduce a group of methods called χ
2
tests (chi-square tests).
We apply these procedures to test hypo theses when data consist of frequency counts
under headings specified by one or more nominal variables. In any χ
2
test, the χ
2
test
statistic must be calculated. The χ
2
statistic is related to the standard normal random
variable z as follows:
χ
2
k
¼
X
k
i¼1
z
2
i
¼
X
k
i¼1
x
i
μ
i
ðÞ
2
σ
2
i
; (4:32)
where k is the degre es of fre edom associated with t he value of χ
2
and the x
i
’s are k
normally distributed independent random variables, i.e. x
i
N μ
i
; σ
i
ðÞ.Theχ
2
statistic with k degrees of freedom is calculated by adding k independent squared
random variables z
2
i
that all have a N 0; 1ðÞdistribution. For example, if k =1,
then
χ
2
1
¼ z
2
;
and χ
2
has only one degree of freedom. The distribution that the χ
2
statistic follows
is cal led the χ
2
(chi-square) distribution. This probability distribution is defined
for only positive values of χ
2
,sinceχ
2
cannot be negative. You already know
that the sum of two or more normal random variables is also normally distributed.
Similarly, every z
2
random variable follows a χ
2
distribution of one degree of
freedom, and addition of k independent χ
2
1
variables produces another χ
2
statistic
that follows the χ
2
distribution of k degrees of freedom. The χ
2
distribution is t hus a
family of distributions, and the degrees of freedom specify the unique χ
2
distribu-
tion curve. The χ
2
k
distribution has a mean equal to k, i.e. the degrees of freedom
associated with the χ
2
statistic. You should try showing this yourself. (Hint: Look
atthemethodusedtoderivethevariance of a normal distribution, which was
presented in Section 3.5.2.)
Broadly, there are three types of χ
2
tests:
274
Hypothesis testing