182 Martin Dietzfelbinger
all about 18,000 pages and about 50 million characters. If Alice needs only
five minutes to read one page, and they work on it without any breaks, the
two of them and the phone tariff unit counter will be busy for more than
60 days.
In a computer, we may represent a character, including commas, spaces,
and line and page breaks, by a binary code, e.g., with eight bits (one byte) per
character. For the whole encyclopedia this will amount to 50 million bytes,
or about 50 Megabytes. Let us assume that Alice and Bob have individually
managed to enter the text of their respective copies into a computer. As soon
as the encyclopedia is stored electronically, it is actually no big deal to send
this amount of data from Australia to England by e-mail. For the sake of
argument, let us assume however that the connection is either very expensive
or very error-prone, so that the transmission of such a large amount of data
via e-mail is impossible or undesirable.
Is there a way for Alice and Bob to find out whether the two texts are
identical without comparing them character by character? They would prefer
to carry out any conversation needed by cellphone, and hence keep the amount
of information that has to be exchanged very small.
If you sometimes send large files via e-mail or if you ever found it necessary
to make room on your brimful hard disk, you know that there are methods
that do something called “data compression.” By such methods, one “squeezes
together” data such as texts or images, so that they take up less space and
transmission times are shorter. Alice and Bob could use such methods. But
even if they manage to achieve a compression to, say, a fifth of the original
length of their data, it would still take too long to carry out the communica-
tion. So, data compression does not solve the problem.
Here is a very simple observation: Alice should start by counting the char-
acters in her encyclopedia. Let us denote the result by n. Alice tells Bob what
n is, which means reading out eight decimal digits, which is a matter of sec-
onds. Bob also has counted the characters in his copy, with a result of n
.
If n and n
are different, Bob can announce that the texts are different. (We
assume that Alice and Bob haven’t miscounted.) From now on we may assume
the texts of Alice and Bob have exactly the same length.
Texts as Sequences of Numbers and Modular Arithmetic
Our long-term goal is to find a trick that makes it possible to solve the text
comparison problem with very short messages. For this, we want to translate
texts into numbers and then calculate with these numbers. A little basic work
is needed for this. We already noted that in a computer each character is
represented by a sequence of eight bits, that is, a byte. A standard way of
doing this is given by the ASCII code. In this code, the characters A, B, C, ...
look like this: 01000001, 01000010, 01000011, etc. These bit patterns can also