management systems now store the actual distribution of
attribute values as a histogram. In a histogram, the values
of an attribute are divided into ranges, and within each
range, a count of the number of rows whose attribute falls
within that range is made.
In the example above we were given the selectivity of
qty ¼ 1000 to be .6. If we know that there are 2,000 differ-
ent quantities in the shipment table out of 100,000 rows,
then the average number of rows for a given quantity
would be 100,000/2,000 ¼ 50. Therefore, the selectivity of
qty ¼ 1000 would be 50/100,000 ¼ .0005. If we have stored
a histogram of quantities in ranges consisting of integer
values: 1, 2, 3, 4, ......, 1,000, 1,001,......2,000, and found
that we had 60,000 rows containing quantity values equal
to 1,000, we would estimate the selectivity of qty ¼ 1000
to be .6. This is a huge difference in accuracy that would
have dramatic effects on query execution plan cost estima-
tion and optimal plan selection.
3.5.3 Estimating the Selectivity Factor for a Join
Estimating the selectivity for a join is difficult if it is based
on nonkeys; in the worst case it can be a Cartesian product
at one extreme or no matches at all at the other extreme. We
focus here on the estimate based on the usual scenario for
joins between a primary key and a nonkey (a foreign key).
Let’s take, for example, the join between a table R1, which
has a primary key, and a table R2, which has a foreign key:
cardðR1 join R2Þ¼S cardð R1ÞcardðR2Þ, 3.6
where S is the selectivity of the common attribute used in
the join, when that attribute is used as a primary key. Let’s
illustrate this computation of the selectivity and then the
size of the joined table, either the final result of the query
or an intermediate table in the query.
3.5.4 Example Query 3.2
Find all suppliers in London with the shipping date of
June 1, 2006.
SELECT supplierName
FROM supplier S, shipment SH
Chapter 3 QUERY OPTIMIZATION AND PLAN SELECTION 19