398 14 Semantic Analysis for Protein Primary Structure
are called biological words. In protein primary structure sequences, we
will discuss what kinds of combinations of amino acids can represents
words, what their structures and meanings are, and how to determine
their properties.
3. Local words. In the English language, besides their meaning, words have
some special properties in the aspect of symbol structure relationships.
For instance, in English words, the letter q must be followed by u, and
the frequency of the three letters ordered to form the word “and” must
be much higher than that of adn, nad, nda, dan, dna. Thus, vectors with
these special statistical properties are called local words.
Local words and biological words are different concepts, but many biolog-
ical words may be or contain local words. For instance, “qu” has a special
mathematical structure, hence it is a local word, but is not an English
word. It can be a component of many English words (for example, “queen”
contains local word “qu” and “and” is both a local word and an English
word). Local words can be found by mathematical means, while biological
words should be given definite biological content, which is the fundamen-
tal purpose of semantic analysis. Our point is that the analysis and search
for local words will promote the search and discovery of biological words.
4. Phrases. They are composed of several words arranged in a certain order.
They may be the superposition of several words or a new word. In math-
ematics, a phrase can be considered to be a compound vector composed
of several vectors. Normally, idioms can be regarded as special words or
phrases.
Databases and the Statistical Distribution of Vectors
Lexical analysis on biological sequences begins with statistical computation
on a database of protein primary structures. We construct the following math-
ematical model for this purpose:
1. Mathematical description of the database. We denote Ω to be a database
of protein primary structures, such as the Swiss-Prot database, etc. Here
Ω consists of M proteins. For instance, in the Swiss-Prot database version
2000, M = 107,618. Thus, Ω can be denoted by a multiple sequence: Ω =
{A
s
, s =1, 2, ··· ,M},whereA
s
=(a
s,1
,a
s,2
, ··· ,a
s,n
s
) is the primary
structure sequence of a single protein, its component a
s,i
∈ V
q
are amino
acids, and n
s
is the length of the protein sequence.
2. Frequency numbers and frequencies determined by a database. If Ω is
given, the frequency numbers and frequencies of different vectors occurring
in this database can be obtained.
Inthefollowing,wedenotebyb
(k)
the fixed vector of rank k in V
(k)
q
.The
number of times it occurs in the database Ω is the frequency number, denoted
by n(b
(k)
). Denote by n
0
the sum of the frequency numbers of all the vectors of
rank k.Thenp(b
(k)
)=
n(b
(k)
)
n
0
will be the normalized frequency or probability