
156 C Closest Substring
A more General Problem
The distinguishing substring selection problem has as input
two sets of strings, B and G. It is required to find a sub-
string of unspecified length (denoted by L)suchthatitis,
informally, close to a substring of every string in B and far
away from every length L substring of strings in
G.How-
ever, we can go through all the possible L and we may as-
sume that every string in
G has the same length L since G
can be reconstructed to contain all substrings of length L
in each of the good strings.
The problem is formally defined as follows: Given a set
B = fs
1
; s
2
;:::;s
n
1
gof n
1
(bad) strings of length at least L,
and a set
G = fg
1
; g
2
;:::g
n
2
gof n
2
(good) strings of length
exactly L,aswellastwointegersd
b
and d
g
(d
b
d
g
), the
distinguishing substring selection problem (DSSP)isto
find a string s such that for each string s
i
2 B there ex-
ists a length-L substring t
i
of s
i
with d(s; t
i
) d
b
and for
any string g
i
2 G, d(s; g
i
) d
g
.Hered(; )representsthe
Hamming distance between two strings. If all strings in
B
are also of the same length L, the problem is called the dis-
tinguishing string problem (DSP).
The distinguishing string problem was first proposed
in [8] for generic drug target design. The following results
are from [3].
Theorem 3 There is a polynomial time approximation
scheme for the distinguishing substring selection problem.
That is, for any constant >0, the algorithm finds a string s
of length L such that for every s
i
2 B, there is a length-L
substring t
i
of s
i
with d(t
i
; s) (1 + )d
b
and for every sub-
string u
i
of length L of every g
i
2 G,d(u
i
; s) (1 )d
g
,if
a solution to the original pair (d
b
d
g
) exists. Since there
are a polynomial number of such pairs (d
b
; d
g
),wecanex-
haust all the possibilities in polynomial time to find a good
approximation required by the corresponding application
problems.
Open Problems
The PTAS’s designed here use linear programming and
randomized rounding technique to solve some cases for
the problem. Thus, the running time complexity of the al-
gorithms for both the closest string and closest substring is
very high. An interesting open problem is to design more
efficient PTAS’s for both problems.
Cross References
Closest Substring
Efficient Methods for Multiple Sequence Alignment
with Guaranteed Error Bounds
Engineering Algorithms for Computational Biology
Multiplex PCR for Gap Closing (Whole-genome
Assembly)
Recommended Reading
1. Ben-Dor, A., Lancia, G., Perone, J., Ravi, R.: Banishing bias from
consensus sequences. In: Proc. 8th Ann. Combinatorial Pattern
Matching Conf., pp. 247–261. (1997)
2. Deng, X., Li, G., Li, Z., Ma, B., Wang, L.: Genetic Design of
Drugs Without Side-Effects. SIAM. J. Comput. 32(4), 1073–1090
(2003)
3. Dopazo, J., Rodríguez, A., Sáiz, J.C., Sobrino, F.: Design of
primers for PCR amplification of highly variable genomes.
CABIOS 9, 123–125 (1993)
4. Frances, M., Litman, A.: On covering problems of codes. Theor.
Comput. Syst. 30, 113–119 (1997)
5. G ˛asieniec, L., Jansson, J., Lingas, A.: Efficient approximation al-
gorithms for the hamming center problem. In: Proc. 10th ACM-
SIAM Symp. on Discrete Algorithms., pp. 135–S906. (1999)
6. Hertz, G., Stormo, G.: Identification ofconsensus patterns in un-
aligned DNA and protein sequences: a large-deviation statisti-
cal basis for penalizing gaps. In: Proc. 3rd Int’l Conf. Bioinfor-
matics and Genome Research, pp. 201–216. (1995)
7. Lanctot, K., Li, M., Ma, B., Wang, S., Zhang, L.: Distinguishing
string selection problems. In: Proc. 10th ACM-SIAM Symp. on
Discrete Algorithms, pp. 633–642. (1999)
8. Lawrence, C., Reilly, A.: An expectation maximization (EM) al-
gorithm for the identification and characterization of common
sites in unaligned biopolymer sequences. Proteins 7, 41–51
(1990)
9. Li, M., Ma, B., Wang, L.: On the closest string and substring
problems. J. ACM 49(2), 157–171 (2002)
10. Li,M.,Ma,B.,Wang,L.:Findingsimilarregionsinmanyse-
quences. J. Comput. Syst. Sci. (1999)
11. Li, M., Ma, B., Wang, L.: Finding similar regions in many strings.
In: Proceedings of the Thirty-first Annual ACM Symposium on
Theory of Computing, pp. 473–482. Atlanta (1999)
12. Ma, B.: A polynomial time approximation scheme for the clos-
est substring problem. In: Proc. 11th Annual Symposium on
Combinatorial Pattern Matching, Montreal, pp. 99–107. (2000)
13. Stormo, G.: Consensus patterns in DNA. In: Doolittle, R.F. (ed.)
Molecular evolution: computer analysis of protein and nucleic
acid sequences. Methods in Enzymology, vol. 183, pp. 211–221
(1990)
14. Stormo, G., Hartzell III, G.W.: Identifying protein-binding sites
from unaligned DNA fragments. Proc. Natl. Acad. Sci. USA. 88,
5699–5703 (1991)
Closest Substring
2005; Marx
JENS GRAMM
WSI Institute of Theoretical Computer Science,
Tübingen University, Tübingen, Germany
Keywords and Synonyms
Common approximate substring