3 Alignments and Approximate String Matching 77
3.3 Approximate String Matching with Mismatches
In this section, we are interested in the search for all the occurrences of a
string x of length m in a string y of length n with at most k mismatches
(k ∈ N , k<m≤ n). The Hamming distance between two strings u and v of
same length is the number of mismatches between u and v and is defined by:
Ham(u, v)=card{i | u[i] = v[i],i=0, 1,...,|u|−1}.
Theproblemcanthenbeexpressedasthesearchforallthepositionsj =
0, 1,...,n− m on y that satisfy the inequality Ham(x, y[j..j+ m − 1]) ≤ k.
3.3.1 Search automaton
A natural solution to this problem consists in using an automaton that recog-
nizes the language V
∗
{w | Ham(x, w) ≤ k}. To do this, we can consider the
non-deterministic automaton defined as follows:
• each state is a pair (, i) where is the level of the state and i is its depth,
with 0 ≤ ≤ k, −1 ≤ i ≤ m − 1 and ≤ i +1;
• the initial state is (0, −1);
• the terminal states are of the form (, m − 1) with 0 ≤ ≤ k;
• the transitions are, for 0 ≤ ≤ k, 0 ≤ i<m− 1 and a ∈ V , either of the
form ((0, −1),a,(0, −1)),oroftheform((, i),x[i +1],
(, i + 1)),orofthe
form ((, i),a,( +1,i+ 1)) if a = x[i +1]and 0 ≤ ≤ k − 1.
The automaton possesses k +1levels, each level allowing to recognize the
prefixes of x with mismatches. The transitions of the form ((, i),a,(, i+1))
correspond to the equality of letters while those of the form ((, i),a,(+1,i+
1)) correspond to the inequality of letters. The loop on the initial state allows
to find all the occurrences of the searched factors. During the analysis of the
text with the automaton, if a terminal state (, m−1) is reached, this indicates
thepresenceofanoccurrenceofx with exactly mismatches.
It is clear that the automaton possesses (k +1)× (m +1−
k
2
) states and
that it can be build in time O(k × m). An example is shown in Fig. 3.22.
Unfortunately, the total number of states obtained by determinizing the au-
tomaton is
Θ(min{m
k+1
, (k + 1)!(k +2)
m−k+1
}).
We can check that a direct simulation of the automaton produces a search
algorithm whose execution time is O(m ×n) using the dynamic programming
as in the previous section. Actually by using a method adapted to the problem
we get, in the rest, an algorithm that performs the search in time O(k×n).This
produces a solution of same complexity as the one of algorithm K-diff-diag
that nevertheless solves a more general problem. But the solution that follows
is based on a simple management of lists without using a search algorithm for
common ancestor.