Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau ■ 419
the file, and each time we perform a single write, we must update the on-disk
protection information. Assuming that we perform 10,000 random writes,
how much I/O traffic (both in total volume of bytes and as a percentage of
total traffic) does our scheme generate?
c. [10] <6.2, 6.4> Now assume that the data protection information is always
kept in a separate portion of the disk, away from the file it is guarding (that is,
assume for each file A, there is another file A
checksums
that holds all the check-
sums for A). Hence, one potential overhead we must incur arises upon
reads—that is, upon each read, we will use the checksum to detect data cor-
ruption.
Assume you read 10,000 blocks of 4 KB each sequentially from disk. Assum-
ing a 4 ms average seek cost and a 100 MB/sec transfer rate (like the SCSI
disk in Figure 6.3), how long will it take to read the file (and corresponding
checksums) from disk? What is the time penalty due to adding checksums?
d. [10] <6.2, 6.4> Again assuming that the data protection information is kept
separate as in part (c), now assume you have to read 10,000 random blocks of
4 KB each from a very large file (much bigger than 10,000 blocks, that is).
For each read, you must again use the checksum to ensure data integrity. How
long will it take to read the 10,000 blocks from disk, again assuming the same
disk characteristics? What is the time penalty due to adding checksums?
6.27 [40] <6.2, 6.3, 6.4> Finally, we put theory into practice by developing a user-
level tool to guard against file corruption. Assume you are to write a simple set of
tools to detect and repair data integrity. The first tool is used to checksums and
parity. It should be called build and used like this:
build <filename>
The build program should then store the needed checksum and redundancy
information for the file filename in a file in the same directory called .file-
name.cp (so it is easy to find later).
A second program is then used to check and potentially repair damaged files. It
should be called repair and used like this:
repair <filename>
The repair program should consult the .cp file for the filename in question and
verify that all the stored checksums match the computed checksums for the data.
If the checksums don’t match for a single block, repair should use the redun-
dant information to reconstruct the correct data and fix the file. However, if two
or more blocks are bad, repair should simply report that the file has been cor-
rupted beyond repair. To test your system, we will provide a tool to corrupt files
called corrupt. It works as follows:
corrupt <filename> <blocknumber>
All corrupt does is fill the specified block number of the file with random noise.
For checksums you will be using MD5. MD5 takes an input string and gives you