Shen S., Tuszynski J.A. Theory and Mathematical Methods for Bioformatics

Подождите немного. Документ загружается.

1.3 Dynamic Programming-Based Algorithms for Pairwise Alignment 17

For the alignment of protein sequences, we usually adopt a scoring matrix.

Since the gene sequence of a protein is complex, the PAM and BLOSUM

matrices are used to obtain the required scoring matrix. We will discuss this

in the corresponding chapters for protein sequence alignment.

We should note that the optimal alignments may not be unique for a given

sequence (A, B) under a given optimal matrix W .Thisisdemonstratedin

Example 2.

Example 2. If



A = (000132) ,

B = (00132) ,

then





= (000132) ,



= (400132) ,





= (000132) ,



= (040132) ,





= (000132)



= (004132)

are the alignments of (A, B) with the minimum penalty scores. Since their

penalty scores are the same



)=d



)=d



)=1,

it is obvious that we can not ﬁnd an alignment with a smaller penalty score.

For simplicity, in the following, the term alignment always refers to the

minimum penalty-based alignment unless otherwise speciﬁed. Obviously, the

corresponding conclusions also apply to the maximum scoring-based align-

ment.

1.3 Dynamic Programming-Based Algorithms

for Pairwise Alignmen t

1.3.1 Introduction to Dynamic Programming-Based Algorithms

Dynamic programming-based algorithms represent the usual method for solv-

ing the optimal problem and are broadly applied in many ﬁelds. The validity

of the dynamic programming-based algorithm depends on whether or not the

problem to be solved has an optimal substructure. That is, it depends on

whether or not the problem satisﬁes the optimizing principle. The so-called

optimal substructures are those substructures for which every optimal solu-

tion of the entire optimal problem (restricted to these substructures) is also an

optimal solution. In the optimal problem for the alignment, this substructure

exists. For example, let



=(a



, ··· ,a





) ,B



=(b



, ··· ,b





)

18 1 Introduction

be the optimal alignment of the given sequence pair

A =(a

, ··· ,a

) ,B=(b

, ··· ,b

) .

Then the penalty score is deﬁned (see (1.8))

w(A







i=1

w(a



)

as a minimum, where w(a



) is the penalty score of a



and b



given by the

penalty matrix. Typically, for a ﬁxed position n

, the penalty is given by

w(A





i=1

w(a







i=n

w (a



) .

Therefore, the pair of subsequences



(0,n

)





, ··· ,a



(0,n

)





, ··· ,b



also represent an optimal alignment. Otherwise, the optimality of A



and



would not be true. Thus, we may use the dynamic programming-based

algorithm to search for the optimal alignment.

Dynamic programming-based algorithms have been successfully used in

bioinformatics to perform alignment for a long time. In 1970, Needleman

and Wunsch proposed the global alignment algorithm [69]. In 1981, Smith

and Waterman gave the mathematical proof [95] and improved the algorithm

to apply to local alignment. The time complexity and space complexity of

both are O(n

). Although the time complexity still cannot be reduced, many

improved algorithms have been proposed [16–19] that may greatly reduce the

space complexity, from O(n

)toO(n).

1.3.2 The Needleman–Wunsch Algorithm,

the Global Alignment Algorithm

The Needleman–Wunsch algorithm is a global alignment algorithm for a pair

of sequences. Its procedure is as follows:

Arrange the Two Sequences in a Two-Dimensional Table

If the sequences are

A =(a

, ··· ,a

) ,B=(b

, ··· ,b

)

then the two-dimensional table is constructed as in Table 1.1, in which the

element s(i, j) in the two-dimensional table is calculated in step 2.

1.3 Dynamic Programming-Based Algorithms for Pairwise Alignment 19

Table 1.1. Two-dimensional table of sequences A, B

... a

s(0, 0) s(1, 0) s(2, 0) ... s(n, 0)

s(0, 1) s(1, 1) s(2, 1) ... s(n, 1)

s(0, 2) s(1, 2) s(2, 2) ... s(n, 2)

... ... ... ... ... ...

s(0,m) s(1,m) s(2,m) ... s(n, m)

Calculate the Elements s(i, j) of the Two-Dimensional Table

Each element s(i, j) of the two-dimensional table is determined by the three

elements; s(i −1,j−1) in the upper left corner, s(i −1,j) on the left side and

s(i, j − 1) on top. First of all, we determine the marginal scores s(i, 0) and

s(0,j). For simplicity, we assume that the penalty score of a string of virtual

symbols is d ×|virtual symbol| if the penalty score of a virtual symbol is d,

where, |virtual symbol| is the length of the string of virtual symbols. Thus,

s(0,j)=−j × d, s(i, 0) = −i × d, letting s(0, 0) = 0.

Then, we calculate s(i, j)usingtheformula:

s(i, j)=max{s(i − 1,j− 1) + s(a

),s(i − 1,j) −d, s(i, j − 1) −d} .

(1.11)

Figure 1.1 illustrates the computation of s(i, j).

While calculating s(i, j), we should also store the three neighbors of s(i, j),

which will be used to produce the backward pathway of the traceback algo-

rithm in the next step.

Traceback Algorithm

The last value s(n, m) is the maximum score of sequences (A, B)afterbe-

ing aligned and s(n, m) is the starting point of the backward pathway. For

Fig. 1.1. Calculation of S(i, j)

20 1 Introduction

each s(i, j), the backward pathway is recorded in the process of calculating

Table 1.1. For example, if s(i, j)=s(i−1,j−1)+s(a

), then the backward

pathway is (i, j) −→ (i − 1,j− 1). Proceeding from s(n, m)backtotheend

s(0, 0), we ﬁnd the backward pathway. We may then recover the alignment of

the sequences according to the backward pathway as follows: For the element

s(i, j) on the backward pathway:

1. Record the corresponding pairs of nucleic acids a

if the backward di-

rection is from a

to its upper left corner.

2. Insert a virtual symbol in the vertical sequence and record (a

, −)ifthe

direction is horizontal.

3. Insert a virtual symbol in the horizontal sequence and record (−,b

)ifthe

direction is vertical.

4. Finally, we obtain the optimal alignment of the two sequences.

The reader should note that sometimes the backward pathway may not be

unique since the backward method itself may not be unique. In fact, it is

possible to have several optimal alignments with the same optimal score.

Example 3. Consider the sequences



A = aaattagc ,

B = gtatatact .

We will use the dynamic programming-based algorithm to obtain the align-

ment. If the penalty score is 5 for matching, −3 for not matching, and −7for

inserting a virtual symbol, that is,



5 , if a

= b

−3 , otherwise ,

and d = 7, then:

1. Build a two-dimensional table and calculate the value of each element.

The values of the elements in the ﬁrst row are deﬁned by s(i, 0) = −i ×d

and the values of the elements in the ﬁrst column are deﬁned by s(0,j)=

−j×d. According to the steps to calculate s(i, j), we may obtain the values

of all s(i, j) and record the backward direction. For example, s(1, 1) =

max(0-3, −7 − 7, −7 − 7) = −3 and the backward direction is (1, 1) −→

(0, 0). The results are shown in Table 1.2.

Following from Table 1.2, the value of the last element is 1, so the score

of the optimal alignment of sequences A, B is 1.

2. Traceback: We go backward from s(9, 8). As the value of s(8, 9) = 1 is

obtained from its top left element s(7, 8), s(8, 9) = s(7, 8) + s(c, t)=

4−3 = 1, we ﬁrst backtrack to the top left element s(7, 8), (8, 9) −→

(7, 8).

This is repeated until the backtracking path reaches s(0, 0).

1.3 Dynamic Programming-Based Algorithms for Pairwise Alignment 21

Table 1.2. Two-dimensional table formed by the sequence A, B

Fig. 1.2. Backtracking path

The backward pathway is (8, 9) −→ (7, 8) −→ (6, 7) −→ (5, 6) −→

(4, 5) −→ (4, 4) −→ (3, 3) −→ (2, 2) −→ (1, 1) −→ (0, 0).

Figure 1.2 shows a schematic representation of the backtracking path.

According to the backward pathway, we can recover the result of the

alignment as follows:



= (aaat-tagc) ,



= (gtatatact) .

1.3.3 The Smith–Waterman Algorithm

In bioinformatics, the role played by global alignment is limited because of the

complexity of biological sequences. Since global optimizing algorithms always

ignore local properties, we sometimes are concerned not with global properties

but with whether or not the two sequences have similar conservation regions.

22 1 Introduction

For example, two sequences with low global similarity may have domains

which are highly homologous. Therefore, ﬁnding alignment algorithms that

target these “domains” with a minimal penalty score would be more useful in

practice.

The Smith–Waterman algorithm is a type of local alignment algorithm.

Although it may simply seem to be an improvement of the dynamic program-

ming-based algorithm which ﬁts local alignment, it is widely useful in bioin-

formatics. For example, BLAST, a well-known software package, has been

developed based on this algorithm. The two aspects of the Smith–Waterman

algorithm which may still be improved are stated as follows.

Calculation of the Values in a Two-Dimensional Table

The Smith–Waterman algorithm adds a 0 while calculating s(i, j). Thus,

a negative score will never occur in the Smith–Waterman algorithm. The

advantage of this will become clear when constructing the backward pathway.

s(i, j)=max

⎧

⎪

⎨

⎪

⎩

0 ,

s(i − 1,j− 1) + s(x

) ,

s(i − 1,j) −d,

s(i, j − 1) − d.

(1.12)

Traceback Algorithm

The start and end points of the backtrace of the Smith–Waterman algorithm

are diﬀerent from the global alignment algorithm. The starting point can be

chosen arbitrarily in theory and we usually choose elements with a higher

score. The end point is the ﬁrst element with the value 0 in the process of

backtrace. If the purpose of alignment is to ﬁnd the optimal alignment of two

sequences, the Smith–Waterman algorithm should backdate from the element

with the maximum score and should not end at the ﬁrst element where the

value 0 appears rather than s(0, 0). The starting point with the maximal score

guarantees the maximal score of local sequence alignment, and the end point

is the ﬁrst element with a value 0, ensuring that segment is not exceeded. At

this time, the segment corresponding to the backward pathway is the segment

with maximum penalty.

We use the same example



A = aaattagc ,

B = gtatatact ,

to ﬁnd the optimal alignment subsequences. The penalty score is 5 for match-

ing, −3 for mismatching and d = 3. The construction of the two-dimensional

table and calculation of the alignment of two sequences is given as shown in

Fig. 1.3.

1.3 Dynamic Programming-Based Algorithms for Pairwise Alignment 23

Fig. 1.3. Backtracking path

The maximum score in this table is 13. Thus, we begin at the corresponding

element s(6, 7) and stop at s(2, 2) which is the ﬁrst element with a value 0.

We then obtain the backward pathway as follows:

(8, 9) −→ (6, 7) −→ (5, 6) −→ (4, 5) −→ (4, 4) −→ (3, 3) −→ (2, 2) .

According to this backward pathway, we obtain the following alignment of

segments with maximal penalty score:

at-ta

atata

Discussion of Dynamic Programming-Based Algorithms

Some notes about the dynamic programming-based algorithms are given be-

low:

1. Diﬀerent penalty matrices produce diﬀerent alignments. So the choice of

an appropriate penalty matrix is very important to the dynamic program-

ming algorithm. Some penalty matrices are appropriate to global align-

ment and some to local alignment. In the extreme case where there is no

negative penalty and the virtual symbol also gives no penalty (choosing

the Hamming penalty matrix), the result of the local alignment algorithm

is almost the same as that of the global alignment algorithm.

2. For the pair of sequences whose lengths are n and m, respectively, we ﬁnd

that the space complexity and time complexity are both O(nm). If the

lengths of the two sequences are approximately equal, then these complex-

ities are close to O(n

). Therefore, the complexity of computation is per-

missible if the sequences are shorter. However, for longer sequences, such

24 1 Introduction

as genome sequences, this computational complexity makes the problem

computationally intractble for present-day computers. If many pairwise

alignments must be performed while doing multiple alignments, the scope

of applications of the dynamic programming-based algorithm is restricted.

The fact that the time complexity would not be reduced is a huge dis-

advantage of the dynamic programming-based algorithm, although many

improved algorithms [16–19] may reduce the space complexity to as low

as O(n).

3. One of the purposes of this book is to show how to create an alignment

algorithm using stochastic analysis, so that the time complexity may be

reduced to as little as O(n) for pairwise alignment. Therefore, we will

not discuss dynamic programming-based algorithms further. The reader

is referred to the relevant literature for further insights.

1.4 Other Notations

There are some notations that arise frequently in this book when discussing

alignment, and we will address them speciﬁcally now.

1.4.1 Correlation Functions of Local Sequences

Local Sequences

Let A, B, C be the three sequences given in (1.1), and let n

be their

lengths, respectively. Let N

= {1, 2, ··· ,n

} be a set of integers which is

the set of the subscripts of A. The subscript i ∈ N

of N

is a subscript (or

position for short) of sequence A. If the subset of N

is represented by α, β,

then

α = {i

, ··· ,i

} (1.13)

is a subset of N

arranged from the largest to the smallest number, 1 ≤ i

< ···<i

≤ N. Then,

= {a

, ··· ,a

} (1.14)

is a subsequence of A in the region α.

If N

and α are both given, then we denote α

= N

−α as the complement

of set α and α

as the subset of set N

.Thus,a

is a subsequence of A and

A =(a

) is referred to as the decomposition of A. In the special case, let

α =[i, j], or (i, j), [i, j), (i, j], in which case,

[i, j]=(i, i +1, ··· ,j) , (i, j)=(i +1,i+1, ··· ,j− 1) ,

[i, j)=(i, i +1, ··· ,j− 1) , (i, j]=(i +1,i+1, ··· ,j) .

1.4 Other Notations 25

These are the closed interval, open interval or half-open interval of N

,re-

spectively. The corresponding vectors are then

[i,j]

=(a

i+1

, ··· ,a

) ,a

(i,j)

=(a

i+1

, ··· ,a

j−1

) ,

[i,j)

=(a

i+1

, ··· ,a

j−1

) ,a

(i,j]

=(a

i+1

, ··· ,a

) . (1.15)

The subsequences of (1.15) are referred to as the local vectors deﬁned on the

intervals [i, j], (i, j)or[i, j), (i, j]. We use the following notation for the local

vectors.

¯a

= a

=(a

i+1

i+2

, ··· ,a

i+k

) , (1.16)

where i denotes the ﬁrst position of vector ¯a

and k denotes the length of ¯a

For simplicity, we consider these three symbols ¯a, a and a

(k)

to be equivalent.

The length of vectors ¯a and a is always k unless otherwise speciﬁed.

Correlation Functions

The local correlation function of sequences A, B based on a penalty matrix w

is deﬁned as follows:

w(A, B; i, j, n)=



k=1

w(a

i+k

j+k

),i+ n ≤ n

,j+ n ≤ n

. (1.17)

In the case B = A, the local correlation function in (1.17) becomes the local

autocorrelation function of A.

1.4.2 Pairwise Alignment Matrices Among Multiple Sequences

We have mentioned above that the minimum penalty algorithm for multiple

sequence alignment is an unsolved problem in bioinformatics, although the fast

algorithm of pairwise alignment has been determined. Therefore, we discuss

the pairwise alignment within multiple sequences before moving on to multiple

alignments. Let

B = {B

s,t

,s,t=1, 2, ···,M} =(B

s,t

)

s,t=1,2,···,m

(1.18)

be the sequence matrix, in which, each B

s,t

=(b

s,t;1

s,t;2

, ··· ,b

s,t;n

s,t

)is

a ﬁve-dimensional vector. That is, for any s, t, j,thereisab

s,t;j

∈ V

Deﬁnition 4.

1. The matrices B in (1.18) are referred to as the pairwise expansion of

multiple sequence A,if(B

s,t

t,s

) is the expansion of the sequence pair

) for each s, t.

26 1 Introduction

2. Matrix B

=(B

s,t

)

s,t=1,2,··· ,m

of (1.18) is referred to as the pairwise min-

imum penalty alignment matrix for the multiple sequence A,ifB

is the

pairwise expansion matrix of A and (B

s,t

t,s

) is the minimum penalty

alignment sequence of (A

) for all s, t. Here,

w(B

s,t

t,s

)=min{w(B

; B

): (B

) is the alignment of (A

)}

(1.19)

is tenable and

w(B

s,t

t,s



s,t



j=1

w(b

s,t;j

t,s;j

) . (1.20)

Following the fast pairwise alignment algorithm, we can determine the pair-

wise minimum penalty alignment matrices B

for the multiple sequence A.

One of the purposes of this book is to demonstrate how to use B

to construct

the minimum penalty alignment of A.

1.5 Remarks

The mathematical methods introduced in this book are suited for bioinformat-

ics and computational biology students. Additionally, this book also refers to

some important databases, such as GenBank [10], PDB [13], PDB-Select [41],

Swiss-Prot [8, 33]. The reader is assumed to be familiar with these databases

in the following aspects:

1. Know the Web sites that provide the databases and know the updating

situation of these databases.

2. Know the content of these databases. For example, representations of

primary structure, secondary structure, and 3D structure may be found

in the PDB database; representations of genes, introns, and exons are in

the GenBank, etc.

3. Know how to obtain the data required when using computers for analy-

sis, e.g., know how to use the computer to obtain the primary structure,

secondary structure, and space coordinates of a given atom based on the

PDB database.

4. Know how to use the corresponding databases for other requirements [99].

Besides databases, the reader should also know how to use some popular soft-

ware packages, for example, BLAST [3], FASTA [73–75] and other specialized

software packages (such as the software package for multiple alignment) that

will be referred to later. For visual software, we recommend RASWIN [83],

which may be used to ﬁnd the 3D conﬁgurations of proteins. Its function

is superior to other packages in some aspects such as rotating, moving and

marking objects. It is available as a free download from its Web site [76].