Elmasri R., Navathe S.B. Fundamentals of Database Systems

Подождите немного. Документ загружается.

722 Chapter 19 Algorithms for Query Processing and Optimization

The Oracle optimizer calculates this cost based on the estimated usage of resources,

such as I/O, CPU time, and memory needed. The goal of cost-based optimization in

Oracle is to minimize the elapsed time to process the entire query.

An interesting addition to the Oracle query optimizer is the capability for an appli-

cation developer to specify hints to the optimizer.

The idea is that an application

developer might know more information about the data than the optimizer. For

example, consider the

EMPLOYEE table shown in Figure 3.6. The Sex column of that

table has only two distinct values. If there are 10,000 employees, then the optimizer

would estimate that half are male and half are female, assuming a uniform data dis-

tribution. If a secondary index exists, it would more than likely not be used.

However, if the application developer knows that there are only 100 male employ-

ees, a hint could be specified in an SQL query whose

WHERE-clause condition is Sex

= ‘M’ so that the associated index would be used in processing the query. Various

hints can be specified, such as:

■

The optimization approach for an SQL statement

■

The access path for a table accessed by the statement

■

The join order for a join statement

■

A particular join operation in a join statement

The cost-based optimization of Oracle 8 and later versions is a good example of the

sophisticated approach taken to optimize SQL queries in commercial RDBMSs.

19.10 Semantic Query Optimization

A different approach to query optimization, called semantic query optimization,

has been suggested. This technique, which may be used in combination with the

techniques discussed previously, uses constraints specified on the database

schema—such as unique attributes and other more complex constraints—in order

to modify one query into another query that is more efficient to execute. We will not

discuss this approach in detail but we will illustrate it with a simple example.

Consider the SQL query:

SELECT E.Lname, M.Lname

FROM EMPLOYEE AS E

, EMPLOYEE AS M

WHERE E.Super_ssn=M.Ssn AND E.Salary > M.Salary

This query retrieves the names of employees who earn more than their supervisors.

Suppose that we had a constraint on the database schema that stated that no

employee can earn more than his or her direct supervisor. If the semantic query

optimizer checks for the existence of this constraint, it does not need to execute the

query at all because it knows that the result of the query will be empty. This may

save considerable time if the constraint checking can be done efficiently. However,

searching through many constraints to find those that are applicable to a given

Such hints have also been called query annotations.

Review Questions 723

query and that may semantically optimize it can also be quite time-consuming.

With the inclusion of active rules and additional metadata in database systems (see

Chapter 26), semantic query optimization techniques are being gradually incorpo-

rated into the DBMSs.

19.11 Summary

In this chapter we gave an overview of the techniques used by DBMSs in processing

and optimizing high-level queries. We first discussed how SQL queries are trans-

lated into relational algebra and then how various relational algebra operations may

be executed by a DBMS. We saw that some operations, particularly

SELECT and

JOIN, may have many execution options. We also discussed how operations can be

combined during query processing to create pipelined or stream-based execution

instead of materialized execution.

Following that, we described heuristic approaches to query optimization, which use

heuristic rules and algebraic techniques to improve the efficiency of query execu-

tion. We showed how a query tree that represents a relational algebra expression can

be heuristically optimized by reorganizing the tree nodes and transforming it

into another equivalent query tree that is more efficient to execute. We also gave

equivalence-preserving transformation rules that may be applied to a query tree.

Then we introduced query execution plans for SQL queries, which add method exe-

cution plans to the query tree operations.

We discussed the cost-based approach to query optimization. We showed how cost

functions are developed for some database access algorithms and how these cost

functions are used to estimate the costs of different execution strategies. We pre-

sented an overview of the Oracle query optimizer, and we mentioned the technique

of semantic query optimization.

Review Questions

19.1. Discuss the reasons for converting SQL queries into relational algebra

queries before optimization is done.

19.2. Discuss the different algorithms for implementing each of the following

relational operators and the circumstances under which each algorithm can

be used:

SELECT, JOIN, PROJECT, UNION, INTERSECT, SET DIFFERENCE,

CARTESIAN PRODUCT.

19.3. What is a query execution plan?

19.4. What is meant by the term heuristic optimization? Discuss the main heuris-

tics that are applied during query optimization.

19.5. How does a query tree represent a relational algebra expression? What is

meant by an execution of a query tree? Discuss the rules for transformation

of query trees and identify when each rule should be applied during opti-

mization.

724 Chapter 19 Algorithms for Query Processing and Optimization

19.6.

How many different join orders are there for a query that joins 10 relations?

19.7. What is meant by cost-based query optimization?

19.8. What is the difference between pipelining and materialization?

19.9. Discuss the cost components for a cost function that is used to estimate

query execution cost. Which cost components are used most often as the

basis for cost functions?

19.10. Discuss the different types of parameters that are used in cost functions.

Where is this information kept?

19.11. List the cost functions for the SELECT and JOIN methods discussed in

Section 19.8.

19.12. What is meant by semantic query optimization? How does it differ from

other query optimization techniques?

Exercises

19.13. Consider SQL queries Q1, Q8, Q1B, and Q4 in Chapter 4 and Q27 in

Chapter 5.

a. Draw at least two query trees that can represent each of these queries.

Under what circumstances would you use each of your query trees?

b. Draw the initial query tree for each of these queries, and then show how

the query tree is optimized by the algorithm outlined in Section 19.7.

c. For each query, compare your own query trees of part (a) and the initial

and final query trees of part (b).

19.14. A file of 4096 blocks is to be sorted with an available buffer space of 64

blocks. How many passes will be needed in the merge phase of the external

sort-merge algorithm?

19.15. Develop cost functions for the PROJECT, UNION, INTERSECTION, SET DIF-

FERENCE

, and CARTESIAN PRODUCT algorithms discussed in Section 19.4.

19.16. Develop cost functions for an algorithm that consists of two SELECTs, a

JOIN, and a final PROJECT, in terms of the cost functions for the individual

operations.

19.17. Can a nondense index be used in the implementation of an aggregate opera-

tor? Why or why not?

19.18. Calculate the cost functions for different options of executing the JOIN oper-

ation

OP7 discussed in Section 19.3.2.

19.19. Develop formulas for the hybrid hash-join algorithm for calculating the size

of the buffer for the first bucket. Develop more accurate cost estimation for-

mulas for the algorithm.

Selected Bibliography 725

19.20.

Estimate the cost of operations OP6 and OP7, using the formulas developed

in Exercise 19.9.

19.21. Extend the sort-merge join algorithm to implement the LEFT OUTER JOIN

operation.

19.22. Compare the cost of two different query plans for the following query:

Salary > 40000

(EMPLOYEE

Dno=Dnumber

DEPARTMENT)

Use the database statistics in Figure 19.8.

Selected Bibliography

A detailed algorithm for relational algebra optimization is given by Smith and

Chang (1975). The Ph.D. thesis of Kooi (1980) provides a foundation for query pro-

cessing techniques. A survey paper by Jarke and Koch (1984) gives a taxonomy of

query optimization and includes a bibliography of work in this area. A survey by

Graefe (1993) discusses query execution in database systems and includes an exten-

sive bibliography.

Whang (1985) discusses query optimization in OBE (Office-By-Example), which is

a system based on the language QBE. Cost-based optimization was introduced in

the SYSTEM R experimental DBMS and is discussed in Astrahan et al. (1976).

Selinger et al. (1979) is a classic paper that discussed cost-based optimization of

multiway joins in SYSTEM R. Join algorithms are discussed in Gotlieb (1975),

Blasgen and Eswaran (1976), and Whang et al. (1982). Hashing algorithms for

implementing joins are described and analyzed in DeWitt et al. (1984),

Bratbergsengen (1984), Shapiro (1986), Kitsuregawa et al. (1989), and Blakeley and

Martin (1990), among others. Approaches to finding a good join order are pre-

sented in Ioannidis and Kang (1990) and in Swami and Gupta (1989). A discussion

of the implications of left-deep and bushy join trees is presented in Ioannidis and

Kang (1991). Kim (1982) discusses transformations of nested SQL queries into

canonical representations. Optimization of aggregate functions is discussed in Klug

(1982) and Muralikrishna (1992). Salzberg et al. (1990) describe a fast external sort-

ing algorithm. Estimating the size of temporary relations is crucial for query opti-

mization. Sampling-based estimation schemes are presented in Haas et al. (1995)

and in Haas and Swami (1995). Lipton et al. (1990) also discuss selectivity estima-

tion. Having the database system store and use more detailed statistics in the form

of histograms is the topic of Muralikrishna and DeWitt (1988) and Poosala et al.

(1996).

Kim et al. (1985) discuss advanced topics in query optimization. Semantic query

optimization is discussed in King (1981) and Malley and Zdonick (1986). Work on

semantic query optimization is reported in Chakravarthy et al. (1990), Shenoy and

Ozsoyoglu (1989), and Siegel et al. (1992).

This page intentionally left blank

727

Physical Database

Design and Tuning

n the last chapter we discussed various techniques by

which queries can be processed efficiently by the

DBMS. These techniques are mostly internal to the DBMS and invisible to the pro-

grammer. In this chapter we discuss additional issues that affect the performance of

an application running on a DBMS. In particular, we discuss some of the options

available to database administrators and programmers for storing databases, and

some of the heuristics, rules, and techniques that they can use to tune the database

for performance improvement. First, in Section 20.1, we discuss the issues that arise

in physical database design dealing with storage and access of data. Then, in Section

20.2, we discuss how to improve database performance through tuning, indexing of

data, database design, and the queries themselves.

20.1 Physical Database Design

in Relational Databases

In this section, we begin by discussing the physical design factors that affect the per-

formance of applications and transactions, and then we comment on the specific

guidelines for RDBMSs.

20.1.1 Factors That Influence Physical Database Design

Physical design is an activity where the goal is not only to create the appropriate

structuring of data in storage, but also to do so in a way that guarantees good per-

formance. For a given conceptual schema, there are many physical design alterna-

tives in a given DBMS. It is not possible to make meaningful physical design

chapter 20

728 Chapter 20 Physical Database Design and Tuning

decisions and performance analyses until the database designer knows the mix of

queries, transactions, and applications that are expected to run on the database.

This is called the job mix for the particular set of database system applications. The

database administrators/designers must analyze these applications, their expected

frequencies of invocation, any timing constraints on their execution speed, the

expected frequency of update operations, and any unique constraints on attributes.

We discuss each of these factors next.

A. Analyzing the Database Queries and Transactions. Before undertaking

the physical database design, we must have a good idea of the intended use of the

database by defining in a high-level form the queries and transactions that are

expected to run on the database. For each retrieval query, the following informa-

tion about the query would be needed:

1. The files that will be accessed by the query.

2. The attributes on which any selection conditions for the query are specified.

3. Whether the selection condition is an equality, inequality, or a range condi-

tion.

4. The attributes on which any join conditions or conditions to link multiple

tables or objects for the query are specified.

5. The attributes whose values will be retrieved by the query.

The attributes listed in items 2 and 4 above are candidates for the definition of

access structures, such as indexes, hash keys, or sorting of the file.

For each update operation or update transaction, the following information

would be needed:

1. The files that will be updated.

2. The type of operation on each file (insert, update, or delete).

3. The attributes on which selection conditions for a delete or update are spec-

ified.

4. The attributes whose values will be changed by an update operation.

Again, the attributes listed in item 3 are candidates for access structures on the files,

because they would be used to locate the records that will be updated or deleted. On

the other hand, the attributes listed in item 4 are candidates for avoiding an access

structure, since modifying them will require updating the access structures.

B. Analyzing the Expected Frequency of Invocation of Queries and

Transactions. Besides identifying the characteristics of expected retrieval queries

and update transactions, we must consider their expected rates of invocation. This

frequency information, along with the attribute information collected on each

query and transaction, is used to compile a cumulative list of the expected fre-

quency of use for all queries and transactions. This is expressed as the expected fre-

quency of using each attribute in each file as a selection attribute or a join attribute,

For simplicity we use the term files here, but this can also mean tables or relations.

20.1 Physical Database Design in Relational Databases 729

over all the queries and transactions. Generally, for large volumes of processing, the

informal 80–20 rule can be used: approximately 80 percent of the processing is

accounted for by only 20 percent of the queries and transactions. Therefore, in prac-

tical situations, it is rarely necessary to collect exhaustive statistics and invocation

rates on all the queries and transactions; it is sufficient to determine the 20 percent

or so most important ones.

C. Analyzing the Time Constraints of Queries and Transactions. Some

queries and transactions may have stringent performance constraints. For example,

a transaction may have the constraint that it should terminate within 5 seconds on

95 percent of the occasions when it is invoked, and that it should never take more

than 20 seconds. Such timing constraints place further priorities on the attributes

that are candidates for access paths. The selection attributes used by queries and

transactions with time constraints become higher-priority candidates for primary

access structures for the files, because the primary access structures are generally the

most efficient for locating records in a file.

D. Analyzing the Expected Frequencies of Update Operations. A minimum

number of access paths should be specified for a file that is frequently updated,

because updating the access paths themselves slows down the update operations. For

example, if a file that has frequent record insertions has 10 indexes on 10 different

attributes, each of these indexes must be updated whenever a new record is inserted.

The overhead for updating 10 indexes can slow down the insert operations.

E. Analyzing the Uniqueness Constraints on Attributes. Access paths should

be specified on all candidate key attributes—or sets of attributes—that are either the

primary key of a file or unique attributes. The existence of an index (or other access

path) makes it sufficient to only search the index when checking this uniqueness

constraint, since all values of the attribute will exist in the leaf nodes of the index.

For example, when inserting a new record, if a key attribute value of the new record

already exists in the index, the insertion of the new record should be rejected, since it

would violate the uniqueness constraint on the attribute.

Once the preceding information is compiled, it is possible to address the physical

database design decisions, which consist mainly of deciding on the storage struc-

tures and access paths for the database files.

20.1.2 Physical Database Design Decisions

Most relational systems represent each base relation as a physical database file. The

access path options include specifying the type of primary file organization for each

relation and the attributes of which indexes that should be defined. At most, one of

the indexes on each file may be a primary or a clustering index. Any number of

additional secondary indexes can be created.

The reader should review the various types of indexes described in Section 18.1. For a clearer under-

standing of this discussion, it is also helpful to be familiar with the algorithms for query processing dis-

cussed in Chapter 19.

730 Chapter 20 Physical Database Design and Tuning

Design Decisions about Indexing. The attributes whose values are required in

equality or range conditions (selection operation) are those that are keys or that

participate in join conditions (join operation) requiring access paths, such as

indexes.

The performance of queries largely depends upon what indexes or hashing schemes

exist to expedite the processing of selections and joins. On the other hand, during

insert, delete, or update operations, the existence of indexes adds to the overhead.

This overhead must be justified in terms of the gain in efficiency by expediting

queries and transactions.

The physical design decisions for indexing fall into the following categories:

1. Whether to index an attribute. The general rules for creating an index on an

attribute are that the attribute must either be a key (unique), or there must

be some query that uses that attribute either in a selection condition (equal-

ity or range of values) or in a join condition. One reason for creating multi-

ple indexes is that some operations can be processed by just scanning the

indexes, without having to access the actual data file (see Section 19.5).

2. What attribute or attributes to index on. An index can be constructed on a

single attribute, or on more than one attribute if it is a composite index. If

multiple attributes from one relation are involved together in several queries,

(for example,

(Garment_style_#, Color) in a garment inventory database), a

multiattribute (composite) index is warranted. The ordering of attributes

within a multiattribute index must correspond to the queries. For instance,

the above index assumes that queries would be based on an ordering of col-

ors within a

Garment_style_# rather than vice versa.

3. Whether to set up a clustered index. At most, one index per table can be a

primary or clustering index, because this implies that the file be physically

ordered on that attribute. In most RDBMSs, this is specified by the keyword

CLUSTER. (If the attribute is a key, a primary index is created, whereas a

clustering index is created if the attribute is not a key—see Section 18.1.) If a

table requires several indexes, the decision about which one should be the

primary or clustering index depends upon whether keeping the table

ordered on that attribute is needed. Range queries benefit a great deal from

clustering. If several attributes require range queries, relative benefits must

be evaluated before deciding which attribute to cluster on. If a query is to be

answered by doing an index search only (without retrieving data records),

the corresponding index should not be clustered, since the main benefit of

clustering is achieved when retrieving the records themselves. A clustering

index may be set up as a multiattribute index if range retrieval by that com-

posite key is useful in report creation (for example, an index on

Zip_code,

Store_id, and Product_id may be a clustering index for sales data).

4. Whether to use a hash index over a tree index. In general, RDBMSs use B

trees for indexing. However, ISAM and hash indexes are also provided in

some systems (see Chapter 18). B

-trees support both equality and range

queries on the attribute used as the search key. Hash indexes work well with

20.1 Physical Database Design in Relational Databases 731

equality conditions, particularly during joins to find a matching record(s),

but they do not support range queries.

5. Whether to use dynamic hashing for the file. For files that are very

volatile—that is, those that grow and shrink continuously—one of the

dynamic hashing schemes discussed in Section 17.9 would be suitable.

Currently, they are not offered by many commercial RDBMSs.

How to Create an Index. Many RDBMSs have a similar type of command for

creating an index, although it is not part of the SQL standard. The general form of

this command is:

CREATE [ UNIQUE ] INDEX <index name>

ON <table name> ( <column name> [ <order> ] { , <column name> [ <order> ] } )

[

CLUSTER ] ;

The keywords

UNIQUE and CLUSTER are optional. The keyword CLUSTER is used

when the index to be created should also sort the data file records on the indexing

attribute. Thus, specifying

CLUSTER on a key (unique) attribute would create some

variation of a primary index, whereas specifying

CLUSTER on a nonkey

(nonunique) attribute would create some variation of a clustering index. The value

for <

order> can be either ASC (ascending) or DESC (descending), and specifies

whether the data file should be ordered in ascending or descending values of the

indexing attribute. The default is ASC. For example, the following would create a

clustering (ascending) index on the nonkey attribute

Dno of the EMPLOYEE file:

CREATE INDEX DnoIndex

ON EMPLOYEE

(Dno)

CLUSTER ;

Denormalization as a Design Decision for Speeding Up Queries. The ulti-

mate goal during normalization (see Chapters 15 and 16) is to separate attributes

into tables to minimize redundancy, and thereby avoid the update anomalies that

lead to an extra processing overhead to maintain consistency in the database. The

ideals that are typically followed are the third or Boyce-Codd normal forms (see

Chapter 15).

The above ideals are sometimes sacrificed in favor of faster execution of frequently

occurring queries and transactions. This process of storing the logical database

design (which may be in BCNF or 4NF) in a weaker normal form, say 2NF or 1NF,

is called denormalization. Typically, the designer includes certain attributes from a

table S into another table R. The reason is that the attributes from S that are

included in R are frequently needed—along with other attributes in R—for answer-

ing queries or producing reports. By including these attributes, a join of R with S is

avoided for these frequently occurring queries and reports. This reintroduces

redundancy in the base tables by including the same attributes in both tables R and

S. A partial functional dependency or a transitive dependency now exists in the table

R, thereby creating the associated redundancy problems (see Chapter 15). A tradeoff

exists between the additional updating needed for maintaining consistency of