
942 T Tail Bounds for Occupancy Problems
alone. They extended the data sets used to include other
tables from the telecom domain as well as biological data.
Vo and Vo [14,15] showed further 10 to 35% improve-
ment over these combinatorial dependency methods on
the same data sets.
Data Sets
Some of the data sets used for experimentation are pub-
lic [3].
URL to Code
The pzip package, based on combinatorial dependency, is
available at http://www.research.att.com/~gsf/pzip/pzip.
html. The Vcodex package, related to invertible trans-
forms, is available at http://www.research.att.com/~gsf/
download/ref/vcodex/vcodex.html. Although for the time
being Vcodex does not include procedures to compress
tabular data, it is a useful toolkit for their development.
Cross References
Binary Decision Graph
Burrows–Wheeler Transform
Dictionary-Based Data Compression
Succinct Data Structures for Parentheses Matching
Tree Compression and Indexing
Recommended Reading
1. Blum,A.,Li,M.,Tromp,J.,Yannakakis,M.:Linearapproximation
of shortest superstrings. J. ACM 41, 630–47 (1994)
2. Buchsbaum, A.L., Caldwell, D.F., Church, K.W., Fowler, G.S.,
Muthukrishnan, S.: Engineering the compression of massive
tables: An experimental approach. In: Proc. 11th ACM-SIAM
Symp. on Discrete Algorithms, 2000, pp. 175–84
3. Buchsbaum, A.L., Fowler, G.S., Giancarlo, R.: Improving table
compression with combinatorial optimization. J. ACM 50, 825–
851 (2003)
4. Burrows, M., Wheeler, D.: A block sorting lossless data com-
pression algorithm. Technical Report 124, Digital Equipment
Corporation (1994)
5. Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE
Trans. Inf. Theory 51, 1523–1545 (2005)
6. Cormack, G.: Data compression in a data base system. Com-
mun. ACM 28, 1336–1350 (1985)
7. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wi-
ley Interscience, New York, USA (1990)
8. Ferragina,P.,Giancarlo,R.,Manzini,G.,Sciortino,M.:Boosting
textual compression in optimal linear time. J. ACM 52, 688–713
(2005)
9. Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Struc-
turing Labeled Trees for Optimal Succinctness, and beyond. In:
Proc. 45th Annual IEEE Symposium on Foundations of Com-
puter Science, 2005, pp. 198–207
10. Giancarlo, R., Sciortino, M., Restivo, A.: From first principles to
the Burrows and Wheeler transform and beyond, via combina-
torial optimization. Theor. Comput. Sci. (2007)
11. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The similarity met-
ric. IEEE Trans. Inf. Theory 50, 3250–3264 (2004)
12. Liefke, H., Suciu, D.: XMILL: An efficient compressor for XML
data. In: Proceedings of the 2000 ACM SIGMOD Int. Conf.
on Management of Data, pp. 153–164. ACM, New York, USA
(2000)
13. Lifshits, Y., Mozes, S., Weimann, O., Ziv-Ukelson, M.: Speeding
up HMM decoding and training by exploiting sequence repeti-
tions. Algorithmica to appear doi:10.1007/s00453-007-9128-0
14. Manzini, G.: An analysis of the Burrows–Wheeler transform.
J. ACM 48, 407–430 (2001)
15. Vo, B.D., Vo, K.-P.: Compressing table data with column depen-
dency. Theor. Comput. Sci. 387, 273–283 (2007)
16. Vo, B.D., Vo, K.-P.: Using column dependency to compress ta-
bles. In: DCC: Data Compression Conference, pp. 92–101. IEEE
Computer Society TCC, Washington DC, USA (2004)
17. Vo., K.-P.: Compression as data transformation. In: DCC: Data
Compression Conference. IEEE Computer Society TCC, pp. 403.
Washington DCD, USA (2006)
18. Ziv, J., Lempel, A.: A universal algorithm for sequential data
compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)
19. Ziv, J., Lempel, A.: Compression of individual sequences via
variable length coding. IEEE Trans. Inf. Theory 24, 530–536
(1978)
Tail Bounds for Occupancy Problems
1995; Kamath, Motwani, Palem, Spirakis
PAUL SPIRAKIS
Computer Engineering and Informatics, Research
and Academic Computer Technology Institute,
Patras University, Patras, Greece
Keywords and Synonyms
Balls and bins
Problem Definition
Consider a random allocation of m balls to n bins where
each ball is placed in a bin chosen uniformly and indepen-
dently. The properties of the resulting distribution of balls
among bins have been the subject of intensive study in
the probability and statistics literature [3,4]. In computer
science, this process arises naturally in randomized algo-
rithms and probabilistic analysis. Of particular interest is
the occupancy problem where the random variable under
consideration is the number of empty bins.
In this entry a series of bounds are presented (reminis-
cent of the Chernoff bound for binomial distributions) on
the tail of the distribution of the number of empty bins; the
tail bounds are successively tighter, but each new bound