416 14 Semantic Analysis for Protein Primary Structure
databases, such as PROSITE, Pratt, EMOTIF, etc. [43, 45, 47, 48, 94]. They
use different methods; for example, Pratt and EMOTIF are index databases
analyzing the structures and functions of the polypeptide chains directly, and
PROSITE is an index database of homologous protein classes obtained by the
alignment of homologous proteins using PSI-BLAST [4]. Therefore, it is mean-
ingful to compare the relationships between these different types of databases.
The core of the combinational graph theory method is the vector segment
of the sequence data. When its length reaches a certain level, data structures of
the long sequences may form a recursive relation. This characteristic is widely
used in the analysis of shift register sequences and codes. Theoretically, its
scope makes it applicable by means of the complexity theory of sequence data
and the theory of Boolean function or the de Bruijn–Good graph of data struc-
tures. Therefore, many theories and tools used in the analysis of shift register
sequences and codes can be brought in. However, the purpose of the research
on biological sequences is different from that of the code analysis. The former
aims to find the relationship between words and the language of biological
sequences, while the latter aims to construct the pseudorandomicity of the
sequences. In-depth discussion on combinatorial graph theory can be found
in the literature [35]. Combinatorial graph theory methods can also be used
to discuss the complexity, classification, cutting and regulation of databases.
In this book, we only discuss the use of core words for the classification and
prediction of homologous proteins.
14.2.1 Notation Used in Combinatorial Graph Theory
The mathematical model and involved definitions and notation used for the
protein primary structures can readily be found in the literature [89], and
hence it will not be repeated in detail. The theory of Boolean functions and
that of the de Bruijn–Good graph can also be found in the literature [35].
Combination Space and Database
Let V
q
= {1, 2, ··· ,q} be a set of integers, which represents an alphabet for
biological sequences. In the database of protein primary structures, we take
q = 20 to denote the 20 commonly occurring amino acids. For the sake of con-
venience, in this book we set q = 23, and take 21 to be the zero element. Here
V
q
is a finite field, in which addition and multiplication are integer operations
modulo 23.
Let V
(k)
q
be the kth ranked product space of V
q
, whose element b
(k)
=
(b
1
,b
2
, ··· ,b
k
) ∈ V
(k)
q
is the kth ranked vector on V
q
. V
(k)
q
is also called the
kth ranked combination space of V
q
.
As mentioned in Sect. 14.1.2, Ω is a database of protein primary structures,
whichiscomposedofM proteins. Here
Ω = {C
s
,s=1, 2, ··· ,M} , (14.7)