Shen S., Tuszynski J.A. Theory and Mathematical Methods for Bioformatics

Подождите немного. Документ загружается.

8.3 Analysis and Exploration 263

characteristic structure needed for the analysis of receptors is relatively

complicated. It requires the knowledge of voids and structural motifs in-

side the protein, types and sizes of cavities and grooves on the surface

of the protein, etc. From these characteristic structures, the potential for

a ligand-receptor interaction can be further evaluated.

These characteristic structural analyses involve a series of problems in geo-

metric calculation, which will be discussed in detail in the following chapters.

8.3 Analysis and Exploration

Exercise 39. Describe the similarities and diﬀerences between the formation

of bonds between diﬀerent amino acids in in vitro experiments in the chemistry

laboratory, and in vivo within living organisms.

Exercise 40. There is a relationship between crystallized proteins and pro-

teins in living organisms (proteins in water or a buﬀer solution). The protein

three-dimensional data in the PDB database refer to crystallized proteins.

Does this have an impact on the analysis of protein function?

Exercise 41. The four-dimensional structure of a protein refers to its dy-

namic (or changing) three-dimensional structure when the protein is under

diﬀerent conditions. How can four-dimensional protein structural data be col-

lected and studied?

Exercise 42. The genetic code table demonstrated how amino acids are en-

coded by triples of nucleotides. It shows a connection between two types of

biological molecules. The biological process involved in producing amino acids

requires a series of functions involving mRNA, tRNA, and rRNA, and as such

is very complicated. Explain this process from the point of view of molecular

movement.

Exercise 43. Protein three-dimensional structure analysis is the ﬁrst step in

investigating protein functions. The interactions between diﬀerent proteins de-

pend ﬁrst on whether or not their conﬁgurations match, and also on whether

diﬀerent amino acids in the matched conﬁgurations can give rise to biochemi-

cal bonds and how strongly they may react. Discuss the eﬀects that molecules

of a drug and viruses may have on the function of proteins.

Informational and Statistical Iterative Analysis

of Protein Secondary Structure Prediction

9.1 Protein Secondary Structure Prediction

and Informational and Statistical Iterative Analysis

9.1.1 Protein Secondary Structure Prediction

The Anﬁnsen Principle of Protein 3D Structure Prediction

The Anﬁnsen (1972, 1973) principle [6] is the foundation of protein 3D struc-

ture prediction. It claims that all the information about a protein’s 3D struc-

ture is contained in its primary structure (sequence). The main experimental

basis for this is the fact that heating proteins in solution to deform their 3D

structures causes these proteins to lose activity. However, when the tempera-

ture is lowered to its original value, the primary structures remain; and the 3D

structure conﬁguration resumes its original state. This experiment indicates

that the primary structure of a protein determines its 3D structure.

In recent years, the Anﬁnsen principle for the protein 3D structure

has been challenged [27, 84, 113]. Some experiments showed that the same

primary structure may form diverse 3D structures under diﬀerent condi-

tions [27, 84, 113]. In other words, the same primary structure may lead to

diﬀerent 3D phase structures.

The Basic Problem of Protein Secondary Structure Prediction

As early as 1951, Pauling et al. proposed that protein partial segments can

form special α-helix and β-sheet structures, which are examples of protein

secondary structures. Thus the basic problem of protein secondary structure

prediction (PSSP) is estimating from the primary structure which partial

segments can form these special α-helix and β-sheet structures. Since then, it

has been discovered that these secondary structures not only exist in proteins

in great amounts, but are also closely related to the protein function. Thus,

prediction of the secondary structure has been an important topic in protein

266 9 Protein Secondary Structure Prediction

structure investigation in the last two decades of the twentieth century. Since

large amounts of data on protein 3D structure had been compiled, the study

of this issue tends to diversiﬁcation, and has led to a great deal of research (see

the literature review in [79]). In this book we do not discuss these results again,

but analyze the informational and statistical characteristics of this problem,

to illustrate its current status.

General Status of the Protein 3D Structure Prediction

Protein 3D structure prediction is classiﬁed into several types. They are sec-

ondary structure prediction, secondary structure component prediction, and

3D structure prediction.

Secondary structure component prediction refers to the prediction of the

α-helix and β-sheet proportions within a protein.

Protein 3D structure prediction aims to predict the 3D structure from the

protein’s primary structure. The key is the long-distance folding problem, and

the commonly used methods are folding pattern classiﬁcation, molecular dy-

namics calculations, partial peptide domain structure analysis, etc. Folding

pattern classiﬁcation aims to classify the types of protein 3D folding patterns,

and make predictions based on these folding types. Molecular dynamics cal-

culations determine the interactions between the atoms within proteins and

build their potential of mean force; the stable 3D conﬁguration is the state

corresponding to the energy minimum of the potential ﬁeld. The domain struc-

ture method builds corresponding databases and weight coeﬃcients for longer

(containing ten amino acids and above) peptide conﬁgurations, and then uses

these coeﬃcients to predict protein 3D conﬁguration.

In recent years, although many papers on protein structure prediction have

been published and much progress has been made, the overall eﬀect is not

ideal. Over the past 20 years, many methods of PSSP have been developed,

but the accuracy rates have always been around 70% at best. The accuracy

rate for 3D structure prediction has been even worse. For instance, in the

most commonly used 3D structure classiﬁcation, for which there is not a single

preferred calculation method, the variance in 3D folding pattern estimation

is huge. Because of the complexity of protein molecular structures, and since

the atoms contained in a protein range in number from hundreds to tens

of thousands, and the interaction force they produce is of many types (the

van der Waals force, electrostatic, hydrogen bonds, etc.), the computational

complexity of the interactions between atoms grows exponentially. Even if

a massive supercomputer is used, it still cannot successfully deal with even

the smallest peptide (such as one containing only 20 amino acids). Therefore,

protein 3D structure prediction still has a long way to go before it can be

considered reliable. In this chapter, we ﬁrst analyze the statistical information

characteristics of protein primary and secondary structure. The prediction

algorithm is also involved.

9.1 PSSP 267

Table 9.1. Primary and secondary structure sequences of protein 12E8H

EVQLQQSGAE VVRSGASVKL SCTASGFNIK DYYIHWVKQR PEKGLEWIGW IDPEIGDTEY VPKFQGKATM TADTSSNTAY

ooSSSSoooS SSSooooSSS SSSSSoooHH HSSSSSSSSo ooooSSSSSS SSoooooSSS oHHHoooSSS SSSooooSSS

4211110011 2113211111 1112331123 3111111110 3210111111 1133121111 1332333112 1113312111

LQLSSLTSED TAVYYCNAGH DYDRGRFPYW GQGTLVTVSA AKTTPPSVYP LAPGSAAQTN SMVTLGCLVK GYFPEPVTVT

SSSooooHHH oSSSSSSSSS oooooooooo oooSSSSSSo ooooooSSSS Sooooooooo ooSSSSSSSS SSoooooSSS

1113111332 2011121201 3210212321 1111111212 1011111111 2111212121 3112111123 2130321121

WNSGSLSSGV HTFPAVLQSD LYTLSSSVTV PSSTWPSETV TCNVAHPASS TKVDKKIVPR D

SHHHoooooS SSoooSSSoo SSSSSSSSSS oHHHoooooo SSSSSSHHHo SSSSSSoooo o

0002323321 1111111002 1121111120 1332333112 1111113312 1111112034 4

9.1.2 Data Source and Informational Statistical Model of PSSP

Data Source of Secondary Structure Prediction

The data source is the foundation of structure prediction. Protein structure

databases are commonly used in secondary structure prediction, such as the

PDB database, in which primary and secondary structure information for ev-

ery classiﬁed protein is given, along with the 3D coordinates of each atom com-

prising the protein. Using this information, models and algorithms of PSSP

can be established.

Data on more than 20,000 proteins and 70,000 peptide chains are con-

tained in the PDB database version 2005. Because of the large numbers of

homologous proteins in the PDB database, it is not suitable to use statistical

analysis. Statistical analysis on the PDB database usually uses PDB-Select

database, in which excess homologous sequences in the PDB database are

deleted, and 3265 sequences are kept. Hence, it is a simpliﬁcation of the PDB

database.

The PDB database gives a clear indication of primary and secondary struc-

ture of every protein, which we express in an alternateive manner in dual-

sequence form, as detailed below. For instance, for protein 12E8H (which has

221 amino acid residues), the primary and secondary structure sequences are

shown in Table 9.1.

In the second line, the letters H, S, and o denote α-helix, β-sheet, and

other structures, respectively. The third line expresses the torsion angle value

of the protein backbone triangle, which will be further discussed later.

The Random Model of Secondary Structure Prediction

Secondary structure prediction is the prediction of the secondary structure

status of each amino acid from the primary structure of the protein. In many

protein databases, the relation between the primary and secondary structure

is random, so we give their random relationship model as follows:

Let ξ

(n)

=(ξ

,ξ

, ··· ,ξ

) be the primary structure sequence of a protein,

where each ξ

, τ =1, 2, ··· ,n, represents the primary structure status viz. the

name of the τth amino acid. Thus, the value of ξ

is in set V

= {0, 1, ···, 19},

and n is the length of the protein.

268 9 Protein Secondary Structure Prediction

Similarly, we set η

(n)

=(η

,η

, ··· ,η

) to be the secondary structure of

the protein, and the value of η

is in set {1, 2, 3} = {H, S, o} ,thenwecall



(n)

,η

(n)



=((ξ

,η

), (ξ

,η

) ··· , (ξ

,η

)) (9.1)

the protein primary–secondary joint structure random model (or joint struc-

ture random sequence), where τ =1, 2, ··· ,n represents the order location

of the sequence, ξ

,η

represent the primary and secondary structure sta-

tus, respectively, at site τ of the protein. Hence the value of (ξ

,η

)isin

⊗ V

,whereV

= {0, 1, 2} = {H, S, o} is the set of protein secondary

structure status. For ease of discussion, we introduce the following notations

and terminologies:

1. In protein primary–secondary structure sequence (ξ

(n)

,η

(n)

), we denote



(3)

,η

(3)



=((ξ

,η

), (ξ

τ +1

,η

τ +1

), (ξ

τ +2

,η

τ +2

)) (9.2)

a tripeptide chain, which begins at site τ, and whose value is in the range

{1, 2, ··· ,n− 2}.

2. Set a

(3)

=(a

), and b

(3)

=(b

) to be the primary and sec-

ondary structure status vector of a protein of length 3, and a



∈ V



∈ V

, τ



=1, 2, 3. Its corresponding dual-vector is denoted by



(3)



=((a

), (b

))

=(s, t, r; i, j, k) , s,t,r ∈ V

, i,j,k ∈ V

. (9.3)

We call this the status vector of the tripeptide chain its primary–secondary

structure.

3. The primary–secondary structure status vector of a tripeptide chain

(3)

) can be considered to be a sample of a random vector (ξ

(3)

,η

(3)

);

thus, we can deﬁne its probability distribution as follows:

p(s, t, r; i, j, k)=P



(3)

,η

(3)



=(s, t, r; i, j, k)



s, t, r ∈ V

, i,j,k ∈ V

. (9.4)

These probability distributions can be obtained from the PDB or PDB-

Select databases.

4. From the joint probability distribution in (9.4), we can obtain the con-

ditional probability distribution, boundary distribution and conditional

boundary distribution. For instance, the boundary distribution is

p(s, t, r)=



i,j,k=0

p(s, t, r; i, j, k) ,p(i, j, k)=



s,t,r=0

p(s, t, r; i, j, k) .

(9.5)

9.1 PSSP 269

The corresponding conditional probability distributions are

p[(i, j, k)|(s, t, r)] = p(s, t, r; i, j, k)/p(s, t, r) , (9.6)

etc.

5. From these probability distributions, all types of Shannon entropies and

interaction information can be obtained, for instance, the joint Shannon

entropy of (ξ

(3)

,η

(3)

)is



(3)

,η

(3)



= −



i,j,k=0



s,t,r=0

p(s, t, r; i, j, k)logp(s, t, r; i, j, k) . (9.7)

The conditional entropy of η

(3)

on ξ

(3)



(3)

|ξ

(3)



= −



i,j,k=0



s,t,r=0

p(s, t, r; i, j, k)logp[(i, j, k)|(s, t, r)] .

(9.8)

The conditional mutual information of (η

,η

)on(ξ

(3)

,η

)is

I(η

; η

|ξ

(3)

,η

)



s,t,r=0



i,j,k=1

p(s, t, r; i, j, k)log

p(i, k|s, t, r, j)

p(i|s, t, r; j)p(k|s, t, r; j)

, (9.9)

where p(i, k|s, t, r, j), p(i|s, t, r; j), p(k|s, t, r; j) are conditional probabili-

ties derived from p(s, t, r; i, j, k).

9.1.3 Informational and Statistical Characteristic Analysis

on Protein Secondary Structure

Informational Characteristic Calculation on Protein Primary

and Secondary Structure

We aim to predict a protein’s secondary structure from its primary struc-

ture, thus we ﬁrst analyze the conditional informational characteristics of the

secondary structures on the primary structures. Our results are shown in Ta-

bles 9.2 and 9.3.

The data in Table 9.2 are results from conditional entropy, where data

at the intersection of the ﬁrst line and the ﬁrst column represents

H(η

|(ξ

,ξ

,η

)) = 0.5798, and data at the intersection of the second

lineandtheﬁrstcolumnrepresentsH(η

|ξ

,ξ

,η

)=0.6807.

The data in Table 9.3 are results from conditional mutual infor-

mation, for instance, data in the ﬁrst line, ﬁrst column represents

I(η

; η

|(ξ

,ξ

,η

)) = 0.31831, while data in the second line, ﬁrst column

represents I(η

; η

|(ξ

,ξ

,η

)) = 0.31782.

270 9 Protein Secondary Structure Prediction

Table 9.2. Conditional entropy of protein primary and secondary structures

Hk|(i, j) j|(i, k) i|(j, k)(j, k)|i (i, k)|j (i, j)|k

(s, t, r) 0.5798 0.2564 0.5451 1.1617 1.1712 1.1197

(s, t) 0.6807 0.3314 0.6293 1.2239 1.2559 1.2785

(s, r) 0.6636 0.3383 0.6363 1.2501 1.2965 1.3153

(t, r) 0.6564 0.3250 0.6670 1.2773 1.3042 1.3264

Hj|ik|ii|jk|ji|kj|k

(s, t, r) 0.5819 0.9054 0.5914 0.6261 0.8634 0.5746

(s, t) 0.6845 1.0338 0.6659 0.7101 0.9877 0.6899

(s, r) 0.6692 0.9944 0.6444 0.6910 0.9311 0.6732

(t, r) 0.6482 0.9796 0.6214 0.6850 0.9318 0.6664

Hi j k (i, j, k)

(s, t, r) 1.2917 1.2822 1.3336 2.4534

(s, t) 1.3667 1.3925 1.4534 2.7319

(s, r) 1.3945 1.4000 1.4120 2.7273

(t, r) 1.4146 1.3671 1.3927 2.7192

Table 9.3. Conditional mutual information of protein primary and secondary struc-

tures

I (i; j)|k (i; k)|j (j; k)|i (i; j)(i; k)(j; k)

(s, t, r) 0.31831 0.04626 0.32556 0.70031 0.42826 0.70757

(s, t) 0.31782 0.02939 0.35310 0.70805 0.41962 0.74333

(s, r) 0.34064 0.02736 0.33083 0.73081 0.41754 0.72101

(t, r) 0.33446 0.02866 0.32325 0.71890 0.41310 0.70769

Informational Characteristic Analysis on Protein Primary

and Secondary Structure

Based on the above, we analyze the information transferring characteristics

of each variable in protein primary and secondary structure as follows:

1. Hidden Markov property holds for the tripeptide chain sequences. We

deﬁne tripeptide chain sequences as

=(ξ

,ξ

τ +1

,ξ

τ +2

) ,τ=1, 2, ··· ,n−2 , (9.10)

where (ξ

,ξ

, ··· ,ξ

) is the primary structure sequence of the protein. For

its conditional mutual information,

I(ζ

; ζ

|ζ

)=I[ξ

; ξ

|(ξ

,ξ

)] = 0.0087 ≈ 0 .

This indicates that when ζ

is ﬁxed, ζ

and ζ

are nearly independent

of each other. Thus the hidden Markov property holds for the tripeptide

chain sequences.

9.2 Informational and Statistical Calculation Algorithms 271

2. Hidden Markov property holds for the tripeptide chain secondary struc-

ture. From the results in Table 9.2, we have

I[i, k|(s, t, r, j)] = I[η

; η

|(ξ

,ξ

,η

)] = 0.04626 .

This indicates that when (ξ

,ξ

,η

)isﬁxed,η

and η

are nearly in-

dependent of each other. Thus, the hidden Markov property holds.

3. We see from the conditional entropy Table 9.2 that

H[η

|(ξ

,ξ

)] = 1.2917 ,H[η

|(ξ

,ξ

)] = 1.2822 ,

H[η

|(ξ

,ξ

)] = 1.3336 .

Hence, out of the predictions of secondary structures η

, η

separate

from the primary structure (ξ

,ξ

) for the tripeptide chain, the best

result is in η

4. From

H[η

|(ξ

,ξ

)] = 1.3671 ,H[η

|(ξ

,ξ

)] = 1.2822 ,

we see that the result of the prediction can be improved by increasing the

number of primary structures, but the eﬀect will be minimal. From

H[η

|(ξ

,ξ

)] = 1.2822 ,H[η

|(ξ

,ξ

,η

)] = 0.5746 ,

H[η

|(ξ

,ξ

,η

)] = 0.2564 ,

we see that the conditional entropy decreases sharply if we use more

secondary structure information; speciﬁcally, the prediction of η

from

(ξ

,ξ

,η

)), is bound to be much better than using only η

for the

prediction.

9.2 Informational and Statistical Calculation Algorithms

for PSSP

9.2.1 Informational and Statistical Calculation for PSSP

To establish informational and statistical calculation algorithms for PSSP,

we must ﬁrst classify the data of protein secondary structure, and build cor-

responding statistical calculation tables of prediction information. Related

discussions follow.

Data Classiﬁcation

We base our discussion of the prediction problem on the PDB-Select database.

We denote its protein sequences as set Ω. This set can be divided in two

subsets, Ω

, Ω

, called the training set and the validation set, respectively.

272 9 Protein Secondary Structure Prediction

Their protein primary–secondary structures are denoted respectively by



= {(A

), (A

), ··· , (A

)} ,

= {(C

), (C

), ··· , (C

)} ,

(9.11)

where A

, C

are the primary structure sequences of the two proteins, re-

spectively, in databases Ω

,andΩ

,andB

, D

are the secondary structure

sequences of the above two proteins s and t in databases Ω

,andΩ

.Wethen

denote

=(z

s,1

s,2

, ··· ,z

s,n

) ,Z= A, B, C, D , z = a, b, c, d ,

for their sequence expression. Using the PDB-Select database, we take m

2765, m

= 500.

Table of Conditional Probability Distribution

From the training set Ω

, we calculate its conditional probability distribution

table, the types of which are

⎧

⎪

⎨

⎪

⎩

Model I: p[i|(s, t, r)],p[j|(s, t, r)],p[k|(s, t, r)] ,

Model II: p[i|(s, t, r, j)],p[j|(s, t, r, i)],p[j|(s, t, r, k)],p[k|(s, t, r, j)] ,

Model III: p[i|(s, t, r, j, k)],p[j|(s, t, r, i, k)],p[k|(s, t, r, i, j)] ,

(9.12)

where the tables of Model I are conditional probability distribution tables of

primary structures on secondary structures, while the tables of Models II and

III are conditional probability distribution tables of primary structures and

some secondary structures on other secondary structures. The sizes of Mod-

els I, II, and III are 8000 ×3, 24,000×4, and 72,000×3 matrices, respectively.

When Ω

is given, the joint probability distribution p(s, t, r; i, j, k) is deter-

mined, and all these conditional probability distributions can be determined

by the joint probability distribution p(s, t, r; i, j, k).

Maximum Likelihood Estimate Prediction

1. Maximum likelihood estimate (MLE) prediction uses the tables of Model I,

for instance, in p[i|(s, t, r)], for every ﬁxed (s, t, r) ∈ V

(3)

, calculate the

max p[i|(s, t, r)] on i =1, 2, 3, denoted by i(s, t, r). Then

p[i(s, t, r)|(s, t, r)] = max{p[1|(s, t, r)],p[2|(s, t, r)],p[3|(s, t, r)]} . (9.13)

If the primary structure of the protein is A =(e

, ··· ,e

), then its

predicted secondary structure is



, ··· ,

n−2



(9.14)

=(i(e

),i(e

), ··· ,i(e

n−3

n−2

n−1

),i(e

n−2

n−1

)) ,

9.2 Informational and Statistical Calculation Algorithms 273

where (e

), (e

), ··· , (e

) are the protein primary–secondary

structures,

is the prediction result of f

2. The most signiﬁcant disadvantage of MLE prediction is that the joint

information of primary and secondary structure in Table 9.2 is not used

comprehensively. If each conditional distribution in (9.13) takes the values:

p[1|(s, t, r)] = 0.3 ,p[2|(s, t, r)] = 0.3 ,p[3|(s, t, r)] = 0.4 ,

then the result predicted from (9.13) is i(s, t, r) = 3. This type of predic-

tion leads to large errors.

Using only the table of Model I in MLE prediction, the correct rate will not

exceed 55%.

Threshold Series Prediction

In order to overcome the disadvantages of MLE prediction, we can adapt from

statistics the threshold series prediction. Its essentials are listed below:

1. Choose the parameters θ

, θ

properly. The prediction can only be

determined if the conditional probability distributions in Models I, II,

and III are respectively greater than these parameters.

2. Using threshold series prediction, it is impossible to predict all the sec-

ondary structures at one time. Therefore, we need to use the threshold

series prediction on conditional probability distributions in Models I, II,

and III repeatedly, to reach the goal of predicting all the secondary struc-

tures. We present the algorithm in the next section.

9.2.2 Informational and Statistical Calculation Algorithm

for PSSP

If E =(e

, ··· ,e

) is the primary structure sequence of a protein, then

we perform a recursive calculation using the table of conditional probability

distributions (9.12) and threshold series prediction. We denote the secondary

structure by F =(f

, ··· ,f

), and the corresponding recursive algorithm

as follows:

Step 9.2.1 Choose parameters θ

,θ

> 0.5, and predict the secondary

structure F from the primary structure E for the ﬁrst time using the

table of conditional probability distributions in (9.12). The main steps

are:

1. For the ﬁxed (e

p+1

p+2

), calculate

p[f

|(e

p+1

p+2

)] ,f

=0, 1, 2 ,τ= p, p +1,p+2.

2. If there exists τ ∈{p, p +1,p +2}, f

∈{0, 1, 2}, such that:

p[f

|(e

p+1

p+2

)] >θ

,thenf

is the secondary structure pre-

diction result of (e

p+1

p+2

)onthesiteτ.