Shen S., Tuszynski J.A. Theory and Mathematical Methods for Bioformatics

Подождите немного. Документ загружается.

1.1 Mutation and Alignment 7

number of biological phenomena which cannot be explained at this time. For

example, the mechanism of mutation within biological sequences is yet to be

fully explored. Mutations can lead to the growth and death of cells, and may

also lead to disease. Sequence alignment is an important tool in the analysis

of the positions and types of mutations hidden in biological sequences, and

allows an exact comparison. The earliest evidence that mutations may cause

tumor growth was found in 1983, when it was shown that cancer is the re-

sult of uncontrolled cell growth in an organism and that this growth is often

due to a mutation. Sequence alignment is also important in that it can be

used to research genetic diseases and epidemics. For example, it is possible

to determine the origin, variety, variance, diﬀusion, and development of an

epidemic, and then ﬁnd the viruses and bacteria responsible and appropri-

ate medication. Thus, sequence alignment is very important in the ﬁelds of

both bioinformatics and biomedicine due to its powerful predictive function.

In order to obtain better high-level alignment algorithms, more mathematical

theories are required.

Sequence alignment has many applications in bioinformatics besides its

direct use in comparing the structures of proteins. Some typical applications

are listed as follows:

Gene Positioning and Gene Searching

For a given gene in a certain organism, we must consider whether that same

gene (or a similar gene) may be found in another organism. These are the

basic problems of gene searching and gene positioning based on the GenBank

database. Gene positioning and gene searching is the basis of gene analysis.

A better method of gene searching would allow the development of a more ac-

curate alignment algorithm. Many alignment software packages have been de-

veloped based on the principles of gene searching and gene positioning, and are

used frequently in bioinformatics, such as BLAST [3,67] and FASTA [73–75].

According to some reports, BLAST is visited by more than 100,000 visitors

per day, lending credence to the statement that sequence alignment is widely

used in the study of bioinformatics.

Gene Repeating and Gene Crossing and Splicing

Gene repeating and gene splicing frequently occur within the same organ-

ism’s genome, which has become obvious as the genomes of various organisms

(including humans) are sequenced. Gene repeating refers to the repetition of

a long DNA segment within the same genome. The length of some segments

may be in the millions of base pairs (bp), and the number of repetitions may

also be in the millions. These repetitions may not be identical, but are typi-

cally similar. They may therefore be found through alignment algorithms.

A gene usually is composed of segments of several diﬀerent genes. These

segments are called exons, and the intervals are called introns. When a gene

8 1 Introduction

translates into a protein, the introns are cut oﬀ, and the exons array in re-

verse order. This phenomenon is known as gene crossing, and its analysis also

depends on alignment algorithms.

Genome Splicing

In the process of gene sequencing, a long sequence of a chromosome is ﬁrst

cut into pieces, the individual DNA segments are then sequenced indepen-

dently, and the segments will be assembled together. That is, the nucleotides

of the entire chromosome are not sequenced simultaneously. To assemble the

segments properly, many copies of a genome are cut into random segments,

which are then sequenced independently. The common information (found

through alignment) is then used to asemble a complete genome.

Other Applications

It is diﬃcult to identify and search the introns and exons of a eukaryote while

identifying and searching the regulation genes. As a result, many identiﬁca-

tion methods have emerged. Among these, alignment-based methods are very

signiﬁcant.

In summary, it is important to note that mutations and alignments are not

used simply to study biological evolution but may also be used to study the

relationships among genes, proteins and biological macromolecular structure

in living systems.

1.1.3 Progress on Alignment Algorithms and Problems

to Be Solved

Researchers today are performing sequence alignments and database searches

billions of times per day. Due to their importance, alignment methods should

be familiar to all biologists and researchers in the ﬁeld of bioinformatics. Align-

ment methods must also be continually updated to address new challenges as

they arise in the life sciences. We now brieﬂy review the progress made in

sequence alignment algorithms, as well as the challenges involved.

Progress in Pairwise Sequence Alignment

The method of dynamic programming-based alignment was ﬁrst proposed by

Needleman and Wunsch [69]. It involves drawing a graph with one sequence

written across the page and the other down the left-hand side, scoring matches

between the sequences (or penalties for mismatches) and linking with the in-

serted virtual symbol. The alignment with the highest (or lowest) possible

score is deﬁned as the optimal alignment. In 1980s, Smith and Waterman [95]

developed an important modiﬁcation of the dynamic programming-based al-

gorithm, referred to as the local optimal alignment or the Smith–Waterman

1.1 Mutation and Alignment 9

algorithm. Segments of this local optimal alignment can be determined inde-

pendently, and a global optimal sequence alignment is obtained by connecting

the segments. Although both approaches are dynamic programming-based al-

gorithms, the Smith–Waterman algorithm greatly simpliﬁes the Needleman–

Wunsch algorithm.

Following the development of the Smith–Waterman algorithm, sequence

alignment became a topic of great importance in bioinformatics. Many papers

were published, which not only improved the Smith–Waterman algorithm,

but also adapted it to apply to protein sequences. As a result, many types of

applied software based on the alignment algorithm were developed, and exist

today as powerful bioinformatics tools. The alignment of protein sequences is

more complex than the alignment of gene sequences because it is much more

diﬃcult to develop scoring matrices (which quantify the similarities between

the sequences) for protein sequences. Researchers have proposed many types of

scoring systems to produce these scoring matrices, such as the PAM system

and the BLOSUM system. For the scoring matrix of the PAM system, the

probability of mutations based on the evolution time of the homologous family

is determined ﬁrst, followed by the development of the scoring matrix. The

scoring matrix of the BLOSUM system ﬁnds the probability of mutations

based on the conservative regions of the homologous family, then develops

a scoring matrix. Therefore, depending on their requirements, users can choose

their scoring matrix based on either the PAM system or the BLOSUM system,

then combine the scoring matrix with a dynamic programming algorithm

to calculate the highest scoring functions. We will go into more detail later

regarding the scoring matrices of the PAM system and the BLOSUM system.

The reader is also referred to the literature [24, 40] for more information on

this topic.

Besides being adapted for use with proteins, there are many other applica-

tions of alignment algorithms. Nowadays, over ten types of software packages

exist for the purpose of database searching. Among them, BLAST and FASTA

are the most popular of those available as free downloads.

A dynamic programming-based algorithm needs to be aligned along both

the vertical axis and the horizontal axis. One must ﬁrst assign the penalty

scores (or matching scores) at the crossed entries intersected by both the

vertical axis and horizontal axis, and the links optimized. Therefore, the com-

putational complexity of this algorithm can not be less than O(n

)(where

n is the length of the aligned sequence). For longer sequences, alignment and

searching are diﬃcult tasks, although they are easily realized using the meth-

ods of computational science. For example, these alignment algorithms cannot

currently be performed on a PC if the length of the sequence exceeds 100 kbp.

For lengths exceeding 10 Mbp, the alignment algorithms cannot be performed

by any computers currently in existence. In 2002, a probability and statistics-

based alignment algorithm was proposed by the Nankai University group,

called the super pairwise alignment algorithm (SPA algorithm for short) [90].

For homologous sequences, the computational complexity of SPA is only O(n),

10 1 Introduction

that is, linearly proportional to the length of sequence. This makes the algo-

rithm run much faster, and makes possible the alignment and searching of

super-long sequences.

It may seem as if the problems inherent in the method of pairwise align-

ment have all been addressed by dynamic programming-based algorithms and

statistical decision-based algorithms. However, there is much room for im-

provement, and for more applications to be developed. For example, the SPA

algorithm is a suboptimal algorithm, although it is able to deal with super-

long sequences. It still has the potential to be further improved, because its

accuracy is lower than that of the optimal solution within 0.1–1%. Addition-

ally, in order to process super-long sequences (i.e., when the length exceeds

100 Mbp), an “anchor” must be incorporated into the algorithm.

Multiple Alignment Algorithms

Compared to pairwise alignment, multiple alignment is much more diﬃcult.

The optimal solution of this problem was regarded as one of the unsolved

problems of computational biology and bioinformatics during the period be-

tween 2000 and 2002. It is sometimes referred to in the literature as the “NP-

complete problem” or the “NP-hard problem” [15,36,46,104,106]. The impor-

tance of multiple alignment has driven the development of software packages

that are able to handle multiple alignment algorithms. These software pack-

ages do not search for the optimal solution theoretically; rather, they make

comparisons based on some speciﬁc indices. In Chap. 5, we will examine the

following indices of multiple alignment:

1. The scope of multiple alignment. The same type of sequences can be

aligned using multiple alignments, i.e., nucleotide sequences are compared

with other nucleotide sequences, and amino acid sequences are compared

with other amino acid sequences. It is generally expected that there is

some further similarity among the sequences to be compared, as multiple

alignments are used to compare homologous sequences.

2. The scale of the multiple alignment. Let (m, n) denote the length

and number of the sequences to be aligned. The maximum size (m, n)

permitted by the multiple alignments is then referred to as the scope of

the multiple alignment. We will list several software packages concerned

with the scale of multiple alignment in Chap. 5.

3. The computational speed. The time required to determine the multiple

alignment of sequences of scale (m, n) is referred to as the computational

speed.

4. The optimizing indices. In the literature, some optimizing indices for

multiple alignments such as the “SP-scoring function” and “entropy func-

tion” are frequently mentioned. In Chap. 5, we introduce two new indices:

“similarity” and “rate of insertion.” We will discuss these indices in more

detail at that time.

1.1 Mutation and Alignment 11

We will also introduce the super multiple alignment (SMA) [89] method in

Chap. 5. The computational complexity of SMA is just O(m · n), and its

scale, speed, and indices are superior to those of HMMER [78]. Based on

HIV-1, we compare the SMA method with both SMA and HMMER methods

for all indices and show the ﬁnal results in Tables 5.1 and 5.2. HIV-1 is the

known genome of the AIDS virus, and according to GenBank, its scale is

(m, n) = (706, 10,000). It takes 40 min to perform multiple alignment on a PC

using SMA, which is better and much faster than HMMER.

Analysis and Application of Multiple Alignment Results

As the results of a pairwise alignment or multiple alignment are produced, the

central problem both in theory and practice becomes the analysis and use of

those results. The most pressing problem is the analysis of multiple alignment

results, which we explain here.

We have mentioned that the purpose of multiple alignment is to determine

the relationships among mutational sequences. Thus, it is ﬁrst necessary to

ﬁnd common conservation regions, which is simpliﬁed by the use of multi-

ple alignment. Correlation among the mutations in diﬀerent sequences then

becomes the key problem. The traditional method of analysis for determin-

ing the mutual relationship is called clustering analysis. The most classical

method is the “system evolutionary tree” or “minimum distance tree” (which

will be discussed in Chap. 6). The “system evolutionary tree” or “minimum

distance tree” method is a clustering relation established by the mutation dis-

tance determined by diﬀerent aligned sequences. Thus, its structure is mea-

surable, and partly reﬂects the degree of being “far” or “near.” However, it

is not comprehensive, and some useful information is missed if the analysis

is based only on the “system evolutionary tree” and the “minimum distance

tree.”

Currently, in order to analyze the results of multiple alignment, we pro-

pose the “multiple sequence mutation network theory” (“mutation network”

for short). The theory is that we can replace the “topologically metric struc-

ture” by a “modulus structure” based on multiple alignment. This is an

eﬀective method to describe the mutations, and is introduced in Chap. 3.

Determination of the modulus structure involves a series of algebraic opera-

tions. The “mutation network” is a combination of the “topological graph”

and “modulus structure,” and as such, it comprises a complete description

of the alignment. We can then endow the mutation network with opera-

tional laws such as decomposition, combination and so on. As a result, the

mutation network theory is an important tool for analyzing alignment re-

sults.

12 1 Introduction

1.1.4 Mathematical Problems Driven by Alignment

and Structural Analysis

With respect to modeling, computation, and analysis, pairwise alignment and

multiple alignment can be seen as typical mathematical problems, rather than

biological problems. Many mathematical theories and methods are involved,

some of which are listed below:

Stochastic Analysis

Stochastic analysis is the basis of the probability and statistics analysis and

stochastic processes used in alignment modeling, the creation of fast algo-

rithms, and analysis of the results. Following from the mechanism of mutation,

we know that mutations at each site in a sequence obey a Poisson ﬂow. Thus,

the structure of diﬀerent types of mutations should be a renewal process. We

can also say that type-I mutated structures obey a Poisson ﬂow based on

observations of many sequences.

Stochastic analysis is the basis of mutation structure analysis, and it al-

lows us to understand the overall data character, based on all given sequences.

It also plays an important role in the development of the alignment algorithm

and computation of the indices (such as complexity estimation, error estima-

tion and the values of optimizing indices).

Algebraic Structure

To describe the structural character of mutation and sequence alignments,

in Chap. 3 we propose algebraic operations for the molecular structure. This

theory deﬁnes the types of various modulus structures, the equivalent repre-

sentation and the algebraic operations. The algebraic operations of modulus

structures are key in the development of fast multiple alignment algorithms

and analysis of the results.

Combinational Graph Theory

Combinatorial graph theory is an important tool, both when building fast

multiple alignment algorithms and when analyzing the alignment output. Us-

ing combinatorial graph theory, cluster analysis is made possible. This is the

basis for analyzing multiple alignment outputs to construct the systemic tree

and minimum distance tree, and also to construct the mutation networks.

Alignment Space Theory

Alignment space is a data space theory based on mutation error, and is a new

concept proposed by the Nankai University group. Alignment space is a non-

linear metric space, and is the theoretical basis for alignment. As a result, its

applications will be far-reaching.

1.2 Basic Concepts in Alignment and Mathematical Models 13

The mathematical problems mentioned above are essential for the con-

struction of alignment algorithms and analysis of the output. This combina-

tion of mathematics and biology is still in its infancy, and many problems

await deeper discussion. There is obviously much space for future develop-

ment.

1.2 Basic Concepts in Alignment

and Mathematical Models

For the sake of simplicity, we conﬁne our discussion to DNA (or RNA) se-

quences unless otherwise speciﬁed. For protein sequences, we need only replace

by V

and then follow a similar argument. Let us begin by introducing

the basic problems and mathematical models that will be used in alignment

and the analysis of mutations.

1.2.1 Mutation and Alignment

Classiﬁcation of Mutations

In molecular biology, some small molecules’ mutation of its sequence A will

cause A to change into sequence B. Sequence B is then referred to as the

mutation (sequence) of A. The mutation of DNA sequences can be classiﬁed

into four types as follows:

Type I: a mutation caused by a nucleotide changing from one into another,

i.e., “a” changing into “g.”

Type II: a mutation caused by a nucleotide segment permuting its position,

i.e., the segment “accgu” permutes into the segment “guacc.”

Type III: a mutation caused by inserting a new segment into an existing se-

quence, i.e., inserting “aa” into the middle position of segment “gguugg”

so that it becomes a new segment “gguaaugg.”

Type IV: a mutation caused by a segment of nucleotides being deleted from

an existing sequence, i.e., deleting the nucleotides “ag” from the segment

“acaguua,” we are left with the segment “acuua.”

Since types I and II do not change the positions of all the nucleotides, these

mutations are called substitution mutations. Types III and IV change the

positions of all the nucleotides, and so these mutations are called displace-

ment mutations. The basic problem of alignment is to search the mutated

sites and decide which regions are conserved and which have been changed.

The evolutionary relationship and the changes of both structure and function

in the evolution process can then be determined. Alignment is obviously an

important tool in this bioinformatics process.

Deﬁnition 1. If sequence B is a mutation sequence of sequence A,andthey

have the same biological meaning, then they are homologous sequences.

14 1 Introduction

In sequence analysis, if we know that B is a mutation sequence of A but we

do not know whether or not they have the same biological meanings (i.e.,

whether the diﬀerences are caused by a metrical error), then we say that the

two sequences are mutually similar. The terms “homologous sequences” and

“similar sequences” are frequently used in the discussion of sequence analysis,

and note should be taken of the distinction.

Deﬁnition of Alignment

To conﬁrm the relationship between the mutations, a common approach is to

compare the diﬀerences within a family of sequences, which can be viewed

as operations in the mathematical sense. This is referred to as sequence

alignment or alignment for short. The key to sequence alignment is decid-

ing on the displacement mutation. Let A, B be the two sequences deﬁned

in (1.1). Inserting the virtual symbol “−”intoA, B so that they become

two new sequences A



, the elements of A



are then in the range of

= {0, 1, 2, 3, 4} = {a, c, g, t, −} where, V

and V

are called the quaternary

set and the ﬁve-element set, respectively.

Deﬁnition 2.

1. Sequence A



is the virtual expansion sequence (expansion, for short) of

sequence A,iftherestofA



is just the old sequence A with the insertion

symbol “−” added.

2. Sequences A



=(a



, ··· ,a





),B



=(b



, ··· ,b





) are called the ex-

pansions of double sequences A, B,ifA



are the expansions of A, B,

respectively.

3. Sequence group A



= {A



,s=1, 2, ··· ,M} is called the expansion of

multiple sequence A,ifeachA



of A



is the expansion of A

4. A is called the original sequence of A



if the multiple sequence A



is an

expansion of A. We then denote the sequence in A







s,1



s,2

, ··· ,a



s,n



s,j

∈ V

. (1.4)

In the expansion A



of A, the data value 4 corresponds to the virtual

insertion data (or symbol) and the data values 0, 1, 2, 3 correspond to the

nucleotide data appearing in sequence A.

5. Sequence group A



is called the alignment sequence group (alignment for

short) of A if the lengths of the sequences in A



are the same, if they are

expansions of A and if a



1,j



2,j

, ··· ,a



M,j

do not simultaneously equal 4

at each position j.

The deﬁnitions of decompression and compression of sequences will be given

in Sect. 3.1.

1.2 Basic Concepts in Alignment and Mathematical Models 15

Optimizing Principles of Alignment

The aim of sequence alignment is to ﬁnd the expansion A



of a given group

A so that all sequences in A



have lower “diﬀerence” or higher “similarity.”

In bioinformatics, “diﬀerence” is usually quantiﬁed using a “penalty matrix”

or “scoring matrix.”

The basis of the penalty function is the penalty matrix. It stands for the

degree of diﬀerence of each molecular unit (such as a nucleotide or amino acid)

in a biological sequence. It is usually expressed in matrix form as follows:

D =(d(a, b))

a,b∈V

. (1.5)

In bioinformatics, the penalty matrix of DNA sequence alignment is usually

ﬁxed by the Hamming matrix or the WT-matrix. The Hamming matrix on

is deﬁned as follows:

(a, b)=



0 , if a = b ∈ V

1 , otherwise ,

(1.6)

while the WT-matrix is

=[d

(a, b)]

a,b∈V

⎛

⎜

⎝

00.77 0.45 0.77 1

0.7700.77 0.45 1

0.45 0.77 0 0.77 1

0.77 0.45 0.77 0 1

11110

⎞

⎟

⎠

. (1.7)

The value of the scoring matrix is a maximum if a = b. Generally, the scoring

matrix is denoted by G =[g(a, b)]

a,b∈V

. The entries in the scoring matrix

are opposite in value to the corresponding values in the penalty matrix. For

example, the scoring matrix of the Hamming matrix is g(a, b)=1−d

(a, b),

a, b ∈ V

The penalty matrix (or scoring matrix) is used to optimize the results of

the alignment. Thus, both matrices are referred to as the optimizing matrix

and are denoted by W =[w(a, b)]

a,b∈V

The optimizing function measures the optimal value of the two sequences.



=(a



, ··· ,a





) ,B



=(b



, ··· ,b





) ,

are two sequences on V

, then the optimizing function is deﬁned as

w(A







j=1





, (1.8)

where w





w(A



) is the average optimal rate of (A



). In

future, we will not distinguish between the optimizing function and average

16 1 Introduction

optimal rate, and the reader is expected to discern which one is implied ac-

cording to the context.

The most frequently used optimizing function in multiple alignment is

the SP-function. If A are multiple sequences given by (1.2), and A



is the

expansion of A given in Deﬁnition 2, then the SP-function is deﬁned by:



m−1



s=1



t>s

w (A



m−1



s=1



t>s





j=1





s,j



t,j



. (1.9)

Then, w



) denotes the optimizing function, or optimizing measurement,

to align the multiple sequences A



Deﬁnition 3. Optimal alignment of multiple sequences is the situation where,

for given multiple sequences A,theexpansionA



satisﬁes the optimizing func-

tion SP-function in (1.9). Alternatively, if we ﬁnd the expansion A



of A such

that

⎧

⎪

⎨

⎪

⎩



)=min{w



): A



isthemultiplesequenceofA

while W is the penalty matrix},



)=max{w



): A



is the multiple sequence of A

while W is the scoring matrix} .

(1.10)



is then called the optimal alignment of A.

The optimal alignment A



determined by the SP-function is called the SP-

optimal solution or the SP-function-based optimal solution. The process to

ﬁnd this SP-optimal solution is known as the SP-method.

Pairwise alignment is the simplest case of multiple alignment. We will

discuss the optimal criteria for multiple alignment in Chap. 7.

Example 1. We discuss the following sequences:

⎧

⎪

⎨

⎪

⎩

A = (00132310322) ,

B = (1323210322) ,



= (400441323103422) ,



= (144323421044322) ,



= (001323410322) ,



= (441323310322) .

We can see that B is a mutated sequence of A,andA



, B



are expansions of

sequences A, B, respectively. Then



)=12>d



)=3.

Therefore, the penalty of (A



) is smaller than that of (A

