Shen S., Tuszynski J.A. Theory and Mathematical Methods for Bioformatics

Подождите немного. Документ загружается.

414 14 Semantic Analysis for Protein Primary Structure

Table 14.8. The eigenvalues of the relative entropy density function of the third rank

Function Mean Variance Standard Maximum Minimum μ +3.5σμ− 3.5σ Number of Number of

type deviation local words local words

(μ)(σ

)(σ)(μ

)(μ

)oftypeIoftypeII

0.02239 0.06933 0.26331 2.82880 −0.98377 0.94399 −0.89920 50 1

0.01590 0.04803 0.21916 2.18053 −0.92031 0.78295 −0.75114 53 6

0.01652 0.04983 0.22323 2.23581 −1.13892 0.79782 −0.76478 49 9

0.01592 0.04823 0.21962 2.18089 −0.90762 0.78458 −0.75274 54 2

0.01778 0.05152 0.22697 1.37851 −1.37851 0.81218 −0.77663 36 48

0.01466 0.04256 0.20630 1.58496 −1.58496 0.73673 −0.70740 48 57

0.01726 0.05002 0.22364 1.50198 −1.50198 0.80001 −0.76549 37 40

0.01400 0.04054 0.20134 1.70828 −1.58496 0.71869 −0.69069 31 46

0.01398 0.04040 0.20100 1.58496 −1.70828 0.71748 −0.68951 39 39

14.2 Permutation and Combination Methods 415

Table 14.9. Vocabulary of local words in tripeptide chains obtained by function k

where τ =3.5

Tripeptide Function k

AAR 0.8800 AWY 0.9401 RRN 1.0088 RCQ 0.8608

NND 1.1113 DWY 1.0809 CCN 0.9025 CCQ 1.1099

CCH 0.9238 CCI 0.9867 CCP 1.1000 CCS 0.9419

CCT 1.1079 CCY 1.0179 CCV 0.9288 CGM 0.9367

CHW 0.8036 CWF 0.9154 CWY 0.9028 QQE 2.1651

QQI 0.9102 QWY 0.8563 EEG 1.0708 HCQ 1.1058

HQE 0.8404 HHE 1.0121 HHI 2.2358 HHF 0.8031

HHS 1.2020 HPG 0.8862 HPY 0.8606 HWV 0.9085

HYQ 0.8992 LCQ 0.8312 PPH 0.8159 PPS 1.3695

SCQ 1.1518 SST 1.0226 SWY 0.9231 TCQ 0.8391

TWY 0.9172 WCQ 1.0100 WHI 1.0138 WHF 1.3881

WWD 1.7609 WWV 0.9166 WYP 0.8767 WYY 1.0541

YWY 1.0033 CMQ −1.1389 CMS −0.7809 EPF −0.8306

HEI −0.7943 HES −0.7871 FKY −0.8506 WPQ −0.8086

WPM −0.7852 YPQ −0.7881

416 14 Semantic Analysis for Protein Primary Structure

databases, such as PROSITE, Pratt, EMOTIF, etc. [43, 45, 47, 48, 94]. They

use diﬀerent methods; for example, Pratt and EMOTIF are index databases

analyzing the structures and functions of the polypeptide chains directly, and

PROSITE is an index database of homologous protein classes obtained by the

alignment of homologous proteins using PSI-BLAST [4]. Therefore, it is mean-

ingful to compare the relationships between these diﬀerent types of databases.

The core of the combinational graph theory method is the vector segment

of the sequence data. When its length reaches a certain level, data structures of

the long sequences may form a recursive relation. This characteristic is widely

used in the analysis of shift register sequences and codes. Theoretically, its

scope makes it applicable by means of the complexity theory of sequence data

and the theory of Boolean function or the de Bruijn–Good graph of data struc-

tures. Therefore, many theories and tools used in the analysis of shift register

sequences and codes can be brought in. However, the purpose of the research

on biological sequences is diﬀerent from that of the code analysis. The former

aims to ﬁnd the relationship between words and the language of biological

sequences, while the latter aims to construct the pseudorandomicity of the

sequences. In-depth discussion on combinatorial graph theory can be found

in the literature [35]. Combinatorial graph theory methods can also be used

to discuss the complexity, classiﬁcation, cutting and regulation of databases.

In this book, we only discuss the use of core words for the classiﬁcation and

prediction of homologous proteins.

14.2.1 Notation Used in Combinatorial Graph Theory

The mathematical model and involved deﬁnitions and notation used for the

protein primary structures can readily be found in the literature [89], and

hence it will not be repeated in detail. The theory of Boolean functions and

that of the de Bruijn–Good graph can also be found in the literature [35].

Combination Space and Database

Let V

= {1, 2, ··· ,q} be a set of integers, which represents an alphabet for

biological sequences. In the database of protein primary structures, we take

q = 20 to denote the 20 commonly occurring amino acids. For the sake of con-

venience, in this book we set q = 23, and take 21 to be the zero element. Here

is a ﬁnite ﬁeld, in which addition and multiplication are integer operations

modulo 23.

Let V

(k)

be the kth ranked product space of V

, whose element b

(k)

, ··· ,b

) ∈ V

(k)

is the kth ranked vector on V

. V

(k)

is also called the

kth ranked combination space of V

As mentioned in Sect. 14.1.2, Ω is a database of protein primary structures,

whichiscomposedofM proteins. Here

Ω = {C

,s=1, 2, ··· ,M} , (14.7)

14.2 Permutation and Combination Methods 417

where C

=(c

s,1

s,2

, ··· ,c

s,n

) is the primary structure sequence of a single

protein and n

is the length of the sth protein sequence, whose components

s,i

∈ V

are the commonly occurring amino acids. If C

is the primary struc-

ture sequence of a protein, then we denote

[i,j]

=(c

s,i

s,i+1

, ··· ,c

s,j

) , 1 ≤ i ≤ j ≤ n

, (14.8)

to represent a polypeptide chain of protein C

, whose length is k = j − i +1.

Here c

[i,j]

∈ V

(k)

is a vector in kth ranked combination space.

Boolean Functions on a Combination Space

If V

(k)

is a combination space and f(b

(k)

) is a single-valued mapping on

(k)

→ V

,thenwesayf is a kth ranked Boolean function with q elements

on V

. In mathematics, the Boolean functions can have several representa-

tions, such as the listing representation, combination representation, function

representation, graph representation, etc. They are described in detail in the

literature [35], and here we only introduce the related notation.

1. Combination representation. The listing representation is what we are fa-

miliar with. The combination representation can be represented by a group

of subsets of V

(k)

= {A

f,1

, A

f,2

, ··· , A

f,q

} , (14.9)

where

f,j



(k)

∈ V

(k)

: f



(k)



= j



,j∈ V

, (14.10)

then A

is called the combination representation of the Boolean function.

Here A

is a division of V

(k)

2. Function representation. If f is a mapping on V

(k)

→ V

,thenf is a func-

tion whose domain is V

(k)

and takes values in V

.IfV

is a ﬁnite ﬁeld,

the Boolean function can be calculated by addition and multiplication

operations on a ﬁnite ﬁeld. The formula is



(k)



q−1



,···,j



i=1

, (14.11)

where α

,···,j

∈ V

, and the addition, multiplication and power oper-

ations in formula (14.11) are operations on ﬁeld V

3. Graph representation. The deﬁnition of graph is given in [12, 35]. In this

section, we denote it by G = {A, V },whereA is a vertex set, V is the

dual point set of A, which is called the edge set in graph theory. In the

following, we denote the vertices in A as a, b, c, etc., and the edges in V as

418 14 Semantic Analysis for Protein Primary Structure

(a, b), (a, c), (b, c), etc. The deﬁnitions of graphs fall into the categories of

ﬁnite graphs, directed graphs, undirected graphs, subgraphs, supergraphs,

plot graphs, etc. In this section we only discuss ﬁnite graphs and directed

graphs.

The Theory of the de Bruijn–Good Graph

One of the important graphs is the de Bruijn–Good graph (which is called

the DG graph for short). We denote a kth ranked DG graph by G

q,k

},where

q,k

= V

(k)

q,k

= V

(k+1)

. (14.12)

Here the elements in B

q,k

can be denoted respectively by



(k)

=(b

, ··· ,b

) ,

(k+1)

=((b

, ··· ,b

), (b

, ··· ,b

k+1

)) .

(14.13)

Then, b

(k+1)

can be considered to be dual points in B

q,k

,oredgesinG

q,k

When q is ﬁxed, we denote G

q,k

= {B

q,k

} as G

= {B

} for short.

In the following, the subgraph of a DG graph is also called a DG graph.

The Boolean graph is an important DG graph. If f is the Boolean function

on V

(k)

→ V

,thenwecallG

= {B,V

} the Boolean graph determined by

f,where

B = V

(k)



(k)

,f(b

(k)

)),b

(k)

∈ B



. (14.14)

Here G

is a subgraph of graph G

. Boolean graphs and Boolean functions

determine each other. In graph theory, DG graphs can have several speciﬁc

representations, which will not be introduced here.

Properties of the Boolean Graph

The deﬁnitions of edge, path and tree in graph theory have been given in

Chap. 6. Detailed properties of the Boolean graph in the DG graph can be

found in the literature [35]. In this section, we only introduce the basic prop-

erties. We know from the deﬁnition of a Boolean graph that a DG graph is

a Boolean graph if and only if there is at most one outer-edge coming from

each vertex. From this we arrive at the following conclusions:

1. There must be several cycles in a Boolean graph. We call the cycle

with only one vertex a trivial cycle. There is no common vertex in

diﬀerent cycles. We denote all the cycles in a Boolean graph G

= {O

, ··· ,O

}, where each cycle is

= {b

s,0

→ b

s,1

→···→b

s,k

−1

→ b

s,k

},s=1, 2, ···,m , (14.15)

where b

s,0

= b

s,k

14.2 Permutation and Combination Methods 419

2. In the vertex set B of a Boolean graph, each vertex b arrives at a cycle O

in the end. In B,wedenoteallvertices arriving at cycle O

by W

3. Sets W

, ··· ,W

are disjoint with each other, and their combination

is set B.

DG Boolean Graph Generated by Sequences

Let C(c

, ··· ,c

) be a sequence of length n in V

,wherec

∈ V

.The

subvector of sequence C is denoted as

(k)

=(c

i+1

, ··· ,c

i+k−1

) , 1 ≤ i ≤ n − k +1 . (14.16)

Let

(C)=



(k)

=(c

i+1

, ··· ,c

i+k−1

) ,i=1, 2, ··· ,n− k +1



, (14.17)

then we call B

(C)thekth ranked vector family determined by sequence C.

Deﬁnition 55. If C is a sequence in V

, for any positive integer k, we deﬁne

the kth ranked DG Boolean graph determined by sequence C as follows:

(C)={B

(C),B

k+1

(C)} , (14.18)

where the elements in B

k+1

(k+1)

=(c

i+1

, ··· ,c

i+k

)=((c

, ··· ,c

i+k−1

), (c

i+1

, ··· ,c

i+k

)) ,

which are dual points in B

(C).ThusG

mined by C.

Graph G

1. Sequence C isapathingraphG

(C), and its terminus is c

(k)

n−k+1

2. For the ﬁxed sequence C, if vertices in B

other, we denote C



=(c

, ··· ,c

), where vector (c

, ··· ,c

)is

given in C. Then, the graph G



) is a Boolean graph, and sequence



comprises a maximum cycle of G



), which traverses each vertex

in B

3. If sequence C can generate a Boolean function, then we take

f(c

i+1

, ··· ,c

i+k−1

)=c

i+k

,i=1, 2, ··· ,n− k +1, (14.19)

to be the Boolean function which generates sequence C. This will be called

the generating function of sequence C for short in the following. In general,

the solution of formula (14.19) is not unique. We denote by F

(k)

(C)all

the solutions that hold for formula (14.19), which are called the Boolean

function family that generates C.

420 14 Semantic Analysis for Protein Primary Structure

4. If there is a vertex occurring in A

(C)

contains a cycle. If c

(k)

= c

(k)

, i<j,then

(k)

→ c

(k)

i+1

→···→c

(k)

j−1

→ c

(k)

= c

(k)

(14.20)

compriseacycle.TohavegraphG

must be increased. This is to be solved by nonlinear complexity theory.

14.2.2 The Complexity of Databases

To discuss under what condition G

the problem of the sequence complexity in cryptography.

The Complexity of a Sequence

In sequence analysis used in cryptography, complexity can be associated with

three diﬀerent deﬁnitions: linear complexity, nonlinear complexity and non-

singular complexity. These concepts are frequently cited in the combinatorial

analysis of semantics, and we begin with these deﬁnitions.

Deﬁnition 56. If C is a given sequence, several deﬁnitions of complexity can

be formulated as follows:

1. We call k = K

a Boolean graph while G

k−1

2. We call k = K

graph while its generating function f is a linear function on V

(k)

→ V

3. We call k = K

function f of C is a nonsingular function on V

(k)

→ V

Nonsingular functions are an important class of Boolean function. We deﬁne

them as follows: in formula (14.19), when (c

i+1

, ··· ,c

i+k−1

)isﬁxed,eachc

corresponds to c

i+k

In the literature [87], a series of properties and formulas for these three

complexities are given. For instance, they follow formula K

The Complexity of a Database

In Deﬁnition 55, the related deﬁnitions of the graph generated by a sequence

and the complexity can be expanded to those of a database. In the database

of protein primary structures given in Sect. 14.1.2, graphs generated by each

protein primary structure sequence C

, are deﬁned in formula (14.18) to be

)={B

),V

)},s=1, 2, ··· ,m . (14.21)

14.2 Permutation and Combination Methods 421

We deﬁne G

(Ω)={A

(Ω), V

(Ω)} as the graph generated by database Ω,

where

(Ω)=

s=1

) , V

(Ω)=

s=1

) .

From graph G

(Ω), the linear complexity, nonlinear complexity and nonsin-

gular complexity of database Ω can be deﬁned similarly. They follow from

Deﬁnition 56, and are not repeated here.

The Biological Signiﬁcance of Complexity

The essential signiﬁcance of the sequence complexity is to discuss under what

conditions in the same sequence, segments of diﬀerent vectors lead to recursive

relations and how the recursive relations express themselves and change. Thus,

it is closely related to the concepts of the regulation and splicing of biological

sequences.

We ﬁnd in the calculation of biological sequences that, the computation

of sequence complexity is eﬀective for single protein sequences, but not as

eﬀective for the analysis of databases Ω.

Example 30. Trichosanthin is a kind of pharmaceutical protein extracted from

Chinese herbs. It was an abortion-inducing drug [105], and in recent years, it

was found to have an inhibition eﬀect on several types of cancer and AIDS;

attracting much attention. In the Swiss-Prot database, two homologous pro-

teins RISA-CHLPN and RISA-CHLTR have the primary structures, given in

Fig. 14.1.

We denote these two sequences by C, D. Their nonlinear complexity and

nonsingular complexity are found to be, respectively,

(C)=K

(D)=3,K

(C, D)=94,

(C)=K

(D)=4,K

(C, D) ≥ 95 .

(14.22)

It shows that the nonlinear complexity of database Ω can become very high (in

the Swiss-Prot database, the nonlinear complexity can be higher than 3000).

The reason for this increase is that there are many mutually homologous se-

quences and self-homologous sequences in database Ω. Therefore, complexity

analysis is useful in the database searches, mutation predictions and the gen-

eral analysis of homologous sequences.

14.2.3 Key Words and Core Words in a Database

The Deﬁnitions of Key Words and Core Words

In order to analyze the database of protein primary structures eﬃciently using

a combinatorial method, we set up the theory of key words and core words in

a database.

422 14 Semantic Analysis for Protein Primary Structure

RISA-CHLPN:

MIRFLVFSLLILTLFLTAPAVEGDVSFRLSGATSSSYGVFISNLRKALPYERKLYDIPLLRSTLPGSQRYALIHLTNYADETISVAIDVTNVYVMG

YRAGDTSYFFNEASATEAAKYVFKDAKRKVTLPYSGNYERLQIAAGKIRENIPLGLPALDSAITTLFYYNANSAASALMVLIQSTSEAARYKFIEQ

QIGKRVDKTFLPSLAIISLENSWSALSKQIQIASTNNGQFETPVVLINAQNQRVTITNVDAGVVTSNIALLLNRNNMAAIDDDVPMAQSFGCGSYAI

RISA-CHLTR:

MIRFLVLSLLILTLFLTTPAVEGDVSFRLSGATSSSYGVFISNLRKALPNERKLYDIPLLRSSLPGSQRYALIHLTNYADETISVAIDVTNVYIMG

YRAGDTSYFFNEASATEAAKYVFKDAMRKVTLPYSGNYERLQTAAGKIRENIPLGLPALDSAITTLFYYNANSAASALMVLIQSTSEAARYKFIEQ

QIGKRVDKTFLPSLAIISLENSWSALSKQIQIASTNNGQFESPVVLINAQNQRVTITNVDAGVVTSNIALLLNRNNMAAMDDDVPMTQSFGCGSYAI

Fig. 14.1. Primary structures of RISA-CHLPN and RISA-CHLTR

14.2 Permutation and Combination Methods 423

Deﬁnition 57. 1. We call vector b

(k)

the τth ranked key word in database Ω,

if the frequency number n

(k)

) of b

(k)

occurring in Ω follows

(k)

)=τ,wheren

(k)

) denotes the number of times vector b

(k)

occurs in database Ω.

2. We call b

(k)

the τth ranked core word in database Ω,ifb

(k)

is the τth

ranked key word in Ω,andn

(k−1)

) >τ, n

(k−1)

) >τ both hold,

where

(k−1)

=(b

, ··· ,b

k−1

) ,b

(k−1)

=(b

, ··· ,b

)

are subvectors with (k −1) elements before or after b

(k)

, respectively. The

ﬁrst ranked key word and core word are called for short the key word and

core word, respectively.

Key words and core words are “labels” for protein primary structure se-

quences. That is, if b

(k)

is a core word in Ω, then there is one and only

one sequence C

in Ω that contains this vector.

Key words and core words can also serve as a “classiﬁcation” method

for proteins. If b

(k)

is the τth ranked key word in Ω, contained in proteins

, i =1, 2, ··· ,k,thenproteinss

, ··· ,s

contain the same key word

(k)

. They can be considered to be homologous (or locally homologous) pro-

teins.

AproteinC

may contain several core words, thus protein primary struc-

ture sequences have multiple “labels” or characteristics.

Example 31. In the trichosanthin RISA-CHLPN and RISA-CHLTR given in

Example 30, the core words of length 6 are

C: RFLVFS, PYERKL, YERKLY, TLPGSQ, TNVYVM, NVYVMG, VYVMGY,

YVMGYR, VMGYRA, NGQFET, GQFETP, FETPVV, NMAAID, MAAIDD,

AAIDDD, DVPMAQ, VPMAQS, PMAQSF, MAQSFG, AQSFGC.

D: IRFLVL, TLFLTT, LPNERK, YIMGYR, VFKDAM, FKDAMR, KDAMRK,

NNMAAM, NMAAMD, MAAMDD, VPMTQS, PMTQSF, MTQSFG, TQSFGC.

Properties of Key Words and Core Words

If b

(k)

, c



)

are two vectors, and there exist 1 ≤ i<j≤ k



,andj −i +1=k

such that b

(k)

= c

[i,j]

holds, then we say that vector c



)

contains b

(k)

,and



)

is an extension of b

(k)

, while vector b

(k)

is a contraction of c



)

.Key

words and core words have the following properties:

1. If b

(k)

is a key word in the protein sequence C

, then any extension b



)

in protein C

is a key word and its frequency number recursively declines

such that n

(k)

) ≥ n



)