62 Chapter 3: Sending and Receiving Messages
■
forms of information because humans can easily deal with it when printed or displayed;
numbers, for example, can be represented as strings of decimal digits.
To send text, the string of characters is translated into a sequence of bytes according
to a character set. The canonical example of a character encoding system is the venerable
American Standard Code for Information Interchange (ASCII), which defines a one-to-one
mapping between a set of the most commonly used printable characters in English and
binary values. For example, in ASCII the digit 0 is represented by the byte value 48, 1 by
49, and so on up to 9, which is represented by the byte value 57. ASCII is adequate for
applications that only need to exchange English text. As the economy becomes increasingly
globalized, however, applications need to deal with other languages, including many that
use characters for which ASCII has no encoding, and even some (e.g., Chinese) that use
more than 256 characters and thus require more than 1 byte per character to encode.
Encodings for the world’s languages are defined by companies and by standards bodies.
Unicode is the most widely recognized such character encoding; it is standardized by the
International Organization for Standardization (ISO).
Fortunately, the .NET framework provides good support for internationalization.
.NET provides classes that can be used to encode text into ASCII, Unicode, or several
variants of Unicode (UTF-7 and UTF-8). Standard Unicode defines a 16-bit (2-byte) code
for each character and thus supports a much larger set of characters than ASCII. In fact,
the Unicode standard currently defines codes for over 49,000 characters and covers “the
principal written languages and symbol systems of the world” [23]. .NET supports a num-
ber of additional encodings as well, and provides a clean separation between its internal
representation and the encoding used when characters are input or output. The default
encoding for C# may vary depending on regional operating system settings but is usu-
ally UTF-8, which supports the entire Unicode character set. (UTF-8, also known as USC
Transformation Format 8-bit form, encodes characters in 8 bits when possible to save
space, utilizing 16 bits only when necessary.) The default encoding is referenced via
System.Text.Encoding.Default.
The System.Text encoding classes provide several mechanisms for converting
between different character sets. The ASCIIEncoding, UnicodeEncoding, UTF7Encoding,
and UTF8-Encoding classes all provide GetBytes() and GetString() methods to convert
from String to byte array or vice versa in the specified encoding. The Encoding class also
contains static versions of some character set classes (ASCII and Unicode) that contain the
same methods. The GetBytes() method returns the sequence of bytes that represent the
given string in encoding of the class used. Similarly, the GetString() method of encod-
ing classes takes a byte array and returns a String instance containing the sequence of
characters represented by the byte sequence according to the invoked encoding class.
Suppose the value of item.itemNumber is 123456. Using ASCII, that part of the string
representation of item produced by ToString() would be encoded as
105
116 101
109
35
61 49
50
51 52 53
54
'i'
't' 'e'
'm'
'#'
'=' '1'
'2'
'3' '4' '5'
'6'