412 14 Semantic Analysis for Protein Primary Structure
Table 14.6. The eigenvalues of relative entropy density function k
0
,andk
1
Mean (μ)Variance
(σ
2
)
Standard
deviation
(σ)
Maximum
(μ
M
)
Minimum
(μ
m
)
μ − 2σμ+2σ
k
0
−0.00633 0.01879 0.13708 0.71381 −0.41384 0.28049 −0.26783
k
1
0.00441 0.01274 0.11289 0.56669 −0.56669 0.23019 −0.22137
Determination of Fourth and Higher Ranked Local Words
The search for fourth ranked local words is similar to that of the second
and third ranked ones, while their relative entropy density functions are more
complex. For example, there are 11+18 = 29 relative entropy density functions
of fourth rank. We can also find local words by choosing a proper value of τ.
Because of the amount of statistical data, local words with a rank higher than
fourth cannot be processed by the statistical methods used for the second and
third ranked ones. They must be processed by permutation and combination
methods or the method of combining lower ranked words. Permutation and
combination methods will be discussed in detail in the next section.
14.2 Permutation and Combination Methods
for Semantic Structures
In the previous section, informational and statistical analysis methods for the
semantic structure of protein primary structure have been given. In using this
method, we notice that the basis of the informational and statistical methods
is the computation of frequency numbers and probabilities for polypeptide
chains. However, if the lengths of the polypeptide chain vectors increase, the
number of combinations (20
n
of the polypeptide chain vector b
(k)
) will increase
rapidly. For example, 20
6
=6.4 ×10
7
. This exceeds the total number of amino
acids in the Swiss-Prot database (version 2000); hence the informational and
statistical methods no longer work. Therefore, other methods must be used
to analyze the semantic structures of higher ranked words.
In this section, we continue to analyze the semantic structure of protein
primary structures with combinatorial graph theory methods, and give the
definition of the key words and core words as well as their characteristic
parameters for protein primary structures in the Swiss-Prot database. The
concept of the key words and core words refers to a special type of polypep-
tide chains, which exist uniquely in a protein database (i.e., in nature). Hence
the key words and core words are actually special kinds of biological signa-
tures [91].
The concept of a biological signature occurs commonly in biological se-
mantic analysis. Besides the small-molecule library and conformation theory
mentioned above, many research institutions are building their own annotated