data:image/s3,"s3://crabby-images/47fbf/47fbfc81a5231842d8feb7425309def54a19b9b4" alt=""
Compressed Pattern Matching C 173
the amount of extra space is proportional to the input size
of P.
Theorem 6 (Amir et al. [4]) There exists an inplace
CPM algorithm for a two-dimensional run-length encoding
scheme which runs in O(jc(T)j + jPjlog ) time using ex-
tra O(c(P)) space, where is the minimum of jPj and the
alphabet size.
Many variants of the CPM problem exist. In what follows,
some of them are briefly sketched. Fully-compressed pat-
tern matching (FCPM) is the complicated version where
both T and P are given in a compressed format. A straight-
line program is a regular collage system with j
Sj =1.
Theorem 7 (Miyazaki et al. [13]) The FCPM problem for
straight-line programs is solved in O(jc(T)j
2
jc(P)j
2
) time
using O(jc(T)jjc(P)j) space.
Approximate compressed pattern matching (ACPM) refers
to the case where errors are allowed.
Theorem 8 (Kärkkäinen et al. [8]) Under the Levenshtein
distance model, the ACPM problem can be solved in O(k
jPjjc(T)j + occ) time for LZ78/LZW, and in O(jPj(k
2
j
Dj+ k jSj)+occ) time for regular collage systems, where k
is the given error threshold.
Theorem 9 (Makinen et al. [11]) Under a weighted edit
distance model, the ACPM problem for run-length encoding
can be solved in O(jPjjc(P)jjc(T)j) time.
Regular expression compressed pattern matching (RCPM)
refers to the case where P can be a regular expression.
Theorem 10 (Navarro [14]) The RCPM problem can
be solved in O(2
jPj
+ jPjjc(T)j + occ jPjlog jPj) time,
where occ is the number of occurrences of P in T.
Applications
CPM techniques enable us to search directly in com-
pressed text databases. One interesting application is
searching over compressed text databases on handheld de-
vices, such as PDAs, in which memory, storage, and CPU
power are limited.
Experimental Resul t s
One important goal of the CPM problem is to per-
form a CPM task faster than a decompression followed
by a simple search. Kida et al. [10] showed experimen-
tally that their algorithms achieve the goal. Navarro and
Tarhio [15] presented BM type algorithms for LZ78/LZW
compression schemes, and showed they are twice as fast
as a decompression followed by a search using the best
algorithms. (The code is available at: www.dcc.uchile.cl/
gnavarro/software.)
Another challenging goal is to perform a CPM task
faster than a simple search over original files in the uncom-
pressed format. The goal is achieved by Manber [12](with
his own compression scheme), and by Shibata et al. [17]
(with BPE). Their search time reduction ratios are nearly
the same as their compression ratios. Unfortunately the
compression ratios are not very high. Moura et al. [5]
achieved the goal by using a bytewise Huffman code on
words. The compression ratio is relatively high, but only
searching for whole words and phrases is allowed.
Cross References
Multidimensional compressed pattern matching is the
complex version of CPM where the text and the pattern are
multidimensional strings in a compressed format. Se-
quential exact string matching, sequential approximate
string matching, regular expression matching,respec-
tively, refer to the simplified versions of CPM, ACPM,
RCPM where the text and the pattern are given as uncom-
pressed strings.
Recommended Reading
1. Amir, A., Benson, G.: Efficient two-dimensional compressed
matching. In: Proc. Data Compression Conference’92 (DCC’92),
pp. 279 (1992)
2. Amir, A., Benson, G., Farach, M.: Let sleeping files lie: Pattern
matching in Z-compressed files. J. Comput. Syst. Sci. 52(2),
299–307 (1996)
3. Amir, A., Benson, G., Farach, M.: Optimal two-dimensional com-
pressed matching. J. Algorithms 24(2), 354–379 (1997)
4. Amir, A., Landau, G.M., Sokol, D.: Inplace run-length 2d com-
pressed search. Theor. Comput. Sci. 290(3), 1361–1383 (2003)
5. de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and
flexible word searching on compressed text. ACM Trans. Inf.
Syst. 18(2), 113–139 (2000)
6. Farach, M., Thorup, M.: String-matching in Lempel–Ziv com-
pressed strings. Algorithmica 20(4), 388–404 (1998)
7. G ˛asieniec, L., Karpinski, M., Plandowski, W., Rytter, W.: Effi-
cient algorithms for Lempel–Ziv encoding. In: Proc. 5th Scan-
dinavian Workshop on Algorithm Theory (SWAT’96). LNCS,
vol. 1097, pp. 392–403 (1996)
8. Kärkkäinen, J., Navarro, G., Ukkonen, E.: Approximate string
matching on Ziv–Lempel compressed text. J. Discret. Algo-
rithms 1(3–4), 313–338 (2003)
9. Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A.,
Arikawa, S.: Collage systems: a unifying framework for com-
pressed pattern matching. Theor. Comput. Sci. 298(1), 253–
272 (2003)
10. Kida, T., Takeda, M., Shinohara, A., Miyazaki, M., Arikawa, S.:
Multiple pattern matching in LZW compressed text. J. Discret.
Algorithms 1(1), 133–158 (2000)