familiar with the scientific notation used to represent real numbers, i.e. numbers with
fractional parts. A number, say 1786.134, can be rewritten as 1.786134 × 10
3
. Here,
the significand is the number 1.786134 that is multiplied by the base 10 raised to a
power 3 that is called the exponent or characteristic. Scientific notation is one method
of representing base-10 floating-point numbers. In this form of notation , only one
digit to the left of the decimal point in the significand is retained, such as 5.64 × 10
–3
or 9.8883 × 10
67
, and the magnitude of the exponent is adjusted accordingly. The
advantage of using the floating-point method as a convention for representing
numbers is that it is concise, standardizable, and can be used to represent very
large and very small numbers using a limited fixed number of bits.
Two commonly used standards f or storing floating-point numbers are the 32-bit
and 64-bit representations, and are known as the single-precision format and
double-precision format, respectively. Since MATLAB stores all floating-point
numbers by default using double precision, and since all major programming
languages support the double-precision data type, we will concentrate our efforts
on understanding how computers store numeric data as double-precision floating-
point numbers.
64-bit digital representations of floating-point numbers use 52 bits to store the
significand, 1 bit to store the sign of the number, and another 11 bits for the
exponent. Note that computers store all floating-point numbers in base-2 format
and therefore not only are the significand and exponent stored as binary numbers,
but also the base to which the exponent is raised is 2. If x stands for 1 bit then a 64-bit
floating-point number in the machine’s memory looks like this:
x xxxxxxxxxxx xxxxxxx ...xxxxxx
"" "
sign s exponent k significand d
1 bit 11 bits 52 bits
The single bit s that conveys the sign indicates a positive number when s = 0. The
range of the exponent is calculated as [0, 2
11
– 1] = [0, 2047]. In order to accom-
modate negative exponents and thereby extend the numeric range to very small
numbers, 1023 is deducted from the binary exponent to give the range [−1023, 1024]
for the exponent. The exponent k that is stored in the computer is said to have a bias
(which is 1023), since the stored value is the sum of the true exponent and the number
1023 (2
10
– 1). Therefore if 1040 is the value of the biased exponent that is stored, the
true exponent is actually 1040 – 1023 = 17.
Overflow and underflow errors
Using 64-bit floating-point number representation, we can determine approximately
the largest and smallest exponent that can be represented in base 10:
*
largest e(base 10) = 308 (since 2
1024
∼ 1.8 × 10
308
),
*
smallest e(base 10) = −308 (since 2
–1023
∼ 1.1 × 10
–308
).
Thus, there is an upper limit on the largest magnitude that can be represented by a
digital machine. The largest number that is recognized by MATLAB can be
obtained by entering the following command in the MATLAB Command Window:
44
realmax
11
1.2 Representation of floating-point numbers