
14.2 Contingency Tables: Testing for Independence 533
probabilities, that is, if p
i j
= P(X = x
i
,Y = y
j
) = P(X = x
i
)P(Y = y
j
) = p
i·
× p
·j
for each i, j. If there exists a cell (i, j) for which p
i j
6= p
i·
×p
·j
, then X and Y are
dependent. The marginal distributions for components X and Y are obtained
by taking the sums of probabilities in the table, row-wise and column-wise,
respectively:
X
x
1
x
2
... x
r
p p
1·
p
2·
... p
r·
and
Y
y
1
y
2
... y
c
p p
·1
p
·2
... p
·c
.
Example 14.1. If (X ,Y ) is defined by
X \ Y 10 20 30 40
−1 0.1 0.2 0 0.05
0 0.2 0 0.05 0.1
1 0.1 0.1 0.05 0.05
then the marginals are
X
−1 0 1
p 0.35 0.35 0.3
and
Y
10 20 30 40
p 0.4 0.3 0.1 0.2
, and X and
Y are dependent since we found a cell, for example, (2, 1), such that 0.2 =
P(X
=0,Y =10) 6= P(X =0) ·P(Y =10) = 0.35 ·0.4 =0.14. As we indicated, it is
sufficient for one cell to violate the condition p
i j
= p
i·
× p
·j
in order for X and
Y to be dependent.
Instead of random variables and cell probabilities, we will consider an em-
pirical counterpart, a table of observed frequencies. The table is defined by
the levels of two factors R and C. The levels are not necessarily numerical
but could be, and most often are, categorical, ordinal, or interval. For example,
when assessing the possible dependence between gender (factor R) and per-
sonal income (factor C), the levels for R are categorical {male, female}, and for
the C interval, say,
{[0,30K), [30K,60K), [60K,100K), ≥ 100K}. In the table
below, factor R has r levels coded as 1,..., r and factor C has c levels coded as
1,... , c. A cell (i, j) is an intersection of the ith row and the jth column and
contains n
i j
observations. The sum of the ith row is denoted by n
i·
while the
sum of the jth column is denoted by n
·j
.
1 2 ··· c Total
1 n
11
n
12
n
1c
n
1·
2 n
21
n
22
n
2c
n
2·
r n
r1
n
r2
n
rc
n
r·
Total n
·1
n
·2
n
·c
n
··
Denote the total number of observations n
··
=
P
r
i
=1
n
i·
=
P
c
j
=1
n
·j
simply by n.
The empirical probability of the cell (i, j) is
n
i j
n
, and the empirical marginal
probabilities of levels i and j are
n
i·
n
and
n
·j
n
, respectively.
When factors R and C are independent, the frequency in the cell (i, j) is
expected to be n
· p
i·
· p
·j
. This can be estimated by empirical frequencies