King M.R., Mody N.A. Numerical and Statistical Methods for Bioengineering: Applications in MATLAB

Подождите немного. Документ загружается.

As per the rules stated above, the following six numbers are rounded to retain

only ﬁve digits:

0.345903 → 0.34590,

13.85748 → 13.857,

7983.9394 → 7983.9,

5.20495478 → 5.2050,

8.94855 → 8.9486,

9.48465 → 9.4846.

The rounding method is used by computers to store ﬂoating-point numbers. Every

machine number is precise to within 5 × 10

−(s+1)

units of the original real number,

where s is the number of signiﬁcant ﬁgures.

Using MATLAB

A MATLAB function called round is used to round numbers to their nearest

integer. The number produced by round does not have a fractional part. This

function rounds down numbers with fractional parts less than 0.5 and rounds up

for fractional parts equal to or greater than 0.5. Try using round on the numbers

1.1, 1.5, 2.50, 2.49, and 2.51. Note that the MATLAB function round does not

round numbers to s signiﬁcant digit s as discussed in the deﬁnition above for

rounding.

1.2.1 How computers store numbers

Computers store all data and instructions in binary coded format or the base-2

number system. The machine language used to instruct a computer to execute

various commands and manipulate operands is written in binary format – a number

system that contains only two digits: 0 and 1. Every binary number is thus con-

structed from these two digits only, as opposed to the base-10 number system that

uses ten digits to represent numbers. The differences between these two number

systems can be further understood by studying Table 1.1.

Each digit in the binary system is called a bit (

binary digit). Because a bit can take

on two values, either 0 or 1, each value can represent one of two physical states – on

or off, i.e. the presence or absence of an electrical pulse, or the ability of a transistor

to switch between the on and off states. Binary code is thus found to be a convenient

method of encoding instructions and data since it has obvious physical signiﬁcance

and can be easily understood by the operations performed by a computer.

The range of the magnitude of numbers, as well as numeri c precision that a

computer can work with, depends on the number of bits allotted for representing

numbers. Programming languages, such as Fortran and C, and mathematical soft-

ware packages such as MATLAB allow users to work in both single and double

precision.

1.2.2 Binary to decimal system

It is important to be familiar with the methods for converting numbers from one

base to another in order to understand the inherent limitations of computers in

working with real numbers in our base-10 system. Once you are well-versed in the

1.2 Representation of floating-point numbers

ways in which round-off errors can arise in different situations, you will be able to

devise suitable algorithms that are more likely to minimize round-off errors. First,

let’s consider the method of converting binary numbers to the decimal system. The

decimal number 111 can be expanded to read as follows:

111 ¼ 1  10

þ 1  10

¼ 100 þ 10 þ 1

¼ 111:

Thus, the position of a digit in any number speciﬁes the magnitude of that particular

digit as indicated by the power of 10 that multiplies it in the expression above. The

ﬁrst digit of this base-10 integer from the right is a multiple of 1, the second digit is a

multiple of 10, and so on. On the other hand, if 111 is a binary number, then the same

number is now equal to

111 ¼ 1  2

þ 1  2

¼ 4 þ 2 þ 1

¼ 7 in base 10:

The decimal equivalent of 111 is also provided in Table 1.1. In the binary system, the

position of a binary digit in a binary number indicates to which power the multiplier, 2, is

raised. Note that the largest decimal value of a binary number comprising n bits is equal to

– 1. For example, the binary number 11111 has 5 bits and is the binary equivalent of

11111 ¼ 1  2

þ 1  2

¼ 16 þ 8 þ 4 þ 2 þ 1

¼ 31

¼ 2

 1:

Table 1.1. Equivalence of numbers in the decimal (base-10) and binary (base-2)

systems

Decimal system

(base 10)

Binary system

(base 2) Conversion of binary number to decimal number

00 0× 2

11 1× 2

210 1× 2

+0× 2

311 1× 2

+1× 2

4 100 1× 2

+0× 2

5 101 1× 2

+0× 2

+1× 2

6 110 1× 2

+1× 2

+0× 2

7 111 1× 2

+1× 2

810001× 2

+0× 2

910011× 2

+0× 2

+1× 2

10 1010 1× 2

+0× 2

+1× 2

+0× 2

=10

"" " "

binary position indicators

Types and sources of numerical error

What is the range of integers that a computer can represent? A certain ﬁxed number

of bits, such as 16, 32, or 64 bits, are allotted to represent every integer. This ﬁxed

maximum number of bits used for storing an integer value is determined by the

computer hardware architecture. If 16 bits are used to store each integer value in

binary form, then the maximum integer value that the computer can represent

is 2

– 1 = 65 535. To include representation of negative integer values, 32 768 is

subtracted intern ally from the integer value represe nted by the 16-bit number to

allow repres entation of integers in the range of [−32 768, 32 767].

What if we have a fractional binary number such as 1011.011 and wish to

convert this binary value to the base-10 system? Just as a digit to the right of a

radix point (decimal point) in the base-10 s ystem represents a multiple of 1/10

raised to a power depending on the position or place value of the decimal digit

with respect to the decimal point, similarly a binary digit placed to the right of a

radix point (binary point) represents a multiple of 1/2 raised to some power that

depends on the position of the binary digit with respect to the radix point. In

other words, just as the fractional part of a decimal number can be expressed as a

sum of the negative powers of 10 (or positive powers of 1/10), similarly a binary

number fraction is actually the sum of the negative powers of 2 (or positive

powers of 1/2). Thus, the decimal value of the binary number 1011.011 is calcu-

lated as follows:

ð1  2

Þþð0  2

Þþð1  2

Þþð0 ð1

2Þ

þð1 ð1

2Þ

Þþð1 ð1

2Þ

¼ 8 þ 2 þ 1 þ 0:25 þ 0:125

¼ 11:375:

1.2.3 Decimal to binary system

Now let’s tackle the method of converting decimal numbers to the base-2 system.

Let’s start with an easy example that involves the conversion of a decimal integer, say

123, to a binary number. This is simply done by resolving the integer 123 as a series of

powers of 2, i.e. 123 = 2

=64+32+16+8+2+1.

The powers to which 2 is raised in the expression indicate the positions for the binary

digit 1. Thus, the binary number equivalent to the decimal value 123 is 1111011,

which requires 7 bits. This expansion process of a decimal number into the sum of

powers of 2 is tedious for large decimal numbers. A simpliﬁed and straightforward

procedure to convert a decimal number into its binary equivalent is shown below

(Mathews and Fink, 2004).

We can express a positive base-10 integer I as an expansion of powers of

2, i.e.

I ¼ b

 2

þ b

n1

 2

n1

þþb

 2

þ b

 2

þ b

 2

;

where b

, b

, ..., b

are binary digits each of value 0 or 1. This expansion can be

rewritten a s follows:

I ¼ 2ðb

 2

n1

þ b

n1

 2

n2

þþb

 2

þ b

 2

Þþb

I ¼ 2  I

þ b

;

1.2 Representation of floating-point numbers

where

¼ b

 2

n1

þ b

n1

 2

n2

þþb

 2

þ b

 2

By writing I in this fashion, we obtain b

Similarly,

¼ 2ðb

 2

n2

þ b

n1

 2

n3

þþb

 2

Þþb

;

i.e.

¼ 2  I

þ b

;

from which we obtain b

and

¼ b

 2

n2

þ b

n1

 2

n3

þþb

 2

Proceeding in this way we can easily obtain all the digits in the binary representation

for I.

Example 1.1 Convert the integer 5089 in base-10 into its binary equivalent. Based on the

preceding discussion, 5089 can be written as

5089 ¼ 2  2544 þ 1 ! b

¼ 1

2544 ¼ 2  1272 þ 0 ! b

¼ 0

1272 ¼ 2  636 þ 0 ! b

¼ 0

636 ¼ 2  318 þ 0 ! b

¼ 0

318 ¼ 2  159 þ 0 ! b

¼ 0

159 ¼ 2  79 þ 1 ! b

¼ 1

79 ¼ 2  39 þ 1 ! b

¼ 1

39 ¼ 2  19 þ 1 ! b

¼ 1

19 ¼ 2  9 þ 1 ! b

¼ 1

9 ¼ 2  4 þ 1 ! b

¼ 1

4 ¼ 2  2 þ 0 ! b

¼ 0

2 ¼ 2  1 þ 0 ! b

¼ 0

1 ¼ 2  0 þ 1 ! b

¼ 1

Thus the binary equivalent of 5089 has 13 binary digits and is 1001111100001. This algorithm, used to

convert a decimal number into its binary equivalent, can be easily incorporated into a MATLAB program.

1.2.4 Binary representation of floating-point numbers

Floating-point numbers are numeric quantities that have a signiﬁcand indicating the

value of the number, which is multiplied by a base raised to some power. You are

Types and sources of numerical error

familiar with the scientiﬁc notation used to represent real numbers, i.e. numbers with

fractional parts. A number, say 1786.134, can be rewritten as 1.786134 × 10

. Here,

the signiﬁcand is the number 1.786134 that is multiplied by the base 10 raised to a

power 3 that is called the exponent or characteristic. Scientiﬁc notation is one method

of representing base-10 ﬂoating-point numbers. In this form of notation , only one

digit to the left of the decimal point in the signiﬁcand is retained, such as 5.64 × 10

–3

or 9.8883 × 10

, and the magnitude of the exponent is adjusted accordingly. The

advantage of using the ﬂoating-point method as a convention for representing

numbers is that it is concise, standardizable, and can be used to represent very

large and very small numbers using a limited ﬁxed number of bits.

Two commonly used standards f or storing ﬂoating-point numbers are the 32-bit

and 64-bit representations, and are known as the single-precision format and

double-precision format, respectively. Since MATLAB stores all ﬂoating-point

numbers by default using double precision, and since all major programming

languages support the double-precision data type, we will concentrate our efforts

on understanding how computers store numeric data as double-precision ﬂoating-

point numbers.

64-bit digital representations of ﬂoating-point numbers use 52 bits to store the

signiﬁcand, 1 bit to store the sign of the number, and another 11 bits for the

exponent. Note that computers store all ﬂoating-point numbers in base-2 format

and therefore not only are the signiﬁcand and exponent stored as binary numbers,

but also the base to which the exponent is raised is 2. If x stands for 1 bit then a 64-bit

ﬂoating-point number in the machine’s memory looks like this:

x xxxxxxxxxxx xxxxxxx ...xxxxxx

"" "

sign s exponent k significand d

1 bit 11 bits 52 bits

The single bit s that conveys the sign indicates a positive number when s = 0. The

range of the exponent is calculated as [0, 2

– 1] = [0, 2047]. In order to accom-

modate negative exponents and thereby extend the numeric range to very small

numbers, 1023 is deducted from the binary exponent to give the range [−1023, 1024]

for the exponent. The exponent k that is stored in the computer is said to have a bias

(which is 1023), since the stored value is the sum of the true exponent and the number

1023 (2

– 1). Therefore if 1040 is the value of the biased exponent that is stored, the

true exponent is actually 1040 – 1023 = 17.

Overflow and underflow errors

Using 64-bit ﬂoating-point number representation, we can determine approximately

the largest and smallest exponent that can be represented in base 10:

largest e(base 10) = 308 (since 2

1024

∼ 1.8 × 10

308

smallest e(base 10) = −308 (since 2

–1023

∼ 1.1 × 10

–308

Thus, there is an upper limit on the largest magnitude that can be represented by a

digital machine. The largest number that is recognized by MATLAB can be

obtained by entering the following command in the MATLAB Command Window:

realmax

1.2 Representation of floating-point numbers

MATLAB outputs the foll owing result:

ans =

1.7977e+308

Any number larger than this value is given a special value by MATLAB equal to

inﬁnity (Inf). Typically, however, when working with programming languages such

as Fortran and C, numbers larger than this value are not recognized, and generation

of such numbers within the machine results in an overﬂow error. Such errors generate

a ﬂoating-point exception causing the program to terminate immediately unless

error-handling measures have been employed within the program. If you type a

number large r than realmax , such as

realmax + 1e+308

MATLAB recognizes this number as inﬁnity and outputs

ans =

Inf

This is MATLAB’s built -in method to handle occurrences of overﬂow . Similarly, the

smallest negative number supported by MATLAB is given by –realmax and

numbers smaller than this number are assigned the value –Inf.

Box 1.3 Selecting subjects for a clinical trial

The fictitious biomedical company Biotektroniks is conducting a double-blind clinical trial to test a

vaccine for sneazlepox, a recently discovered disease. The company has 100 healthy volunteers: 50

women and 50 men. The volunteers will be divided into two groups; one group will receive the normal

vaccine, while the other group will be vaccinated with saline water, or placebos. Both groups will have

25 women and 25 men. In how many ways can one choose 25 women and 25 men from the group of 50

women and 50 men for the normal vaccine group?

The solution to this problem is simply N ¼ C

 C

, where C

r! nrðÞ!

. (See Chapter 3 for a

discussion on combinatorics.) Thus,

N ¼

C 

C ¼

50!50!

ð25!Þ

On evaluating the factorials we obtain

N ¼

3:0414  10

 3:0414  10

ð1:551  10

9:2501  10

128

5:784 10

100

When using double precision, these extraordinarily large numbers will still be recognized. However, in

single precision, the range of recognizable numbers extends from −3.403 × 10

to 3.403 × 10

(use the

MATLAB function realmax(‘single’) for obtaining these values). Thus, the factorial calculations

would result in an overflow if one were working with single-precision arithmetic. Note that the final answer

is 1.598 × 10

, which is within the defined range for the single-precision data type.

How do we go about solving such problems without encountering overflow? If you know you will be

working with large numbers, it is important to check your product to make sure it does not exceed a

certain limit. In situations where the product exceeds the set bounds, divide the product periodically by

a large number to keep it within range, while keeping track of the number of times this division step is

performed. The final result can be recovered by a corresponding multiplication step at the end, if

needed. There are other ways to implement algorithms for the calculation of large products without

running into problems, and you will be introduced to them in this chapter.

Types and sources of numerical error

Similarly, you can ﬁnd out the smallest positive number greater than zero that is

recognizable by MATLAB by typing into the MATLAB Command Window

realmin

ans =

2.2251e-308

Numbers produced by calculations that are smaller than the smallest number

supported by the machine generate underﬂow errors. Most programming languages

are equipped to handle underﬂow errors without resulting in a program crash and

typically set the number to zero. Let’s observe how MATLAB handles such a

scenario. In the Command Window, if we type the number

1.0e-309

MATLAB outputs

ans =

1.0000e-309

MATLAB has special methods to handle numbers slightly smaller than realmin by

taking a few bits from the signiﬁcand and adding them to the exponent. This, of

course, compromises the precision of the numeric value stored by reducing the

number of signiﬁcant digits.

The lack of continuity between the smallest number representable by a computer

and 0 reveals an important source of error: ﬂoating-point numbers are not contin-

uous due to the ﬁnite limits of range and precision. Thus, there are gaps between two

ﬂoating-point numbers that are closest in value. The magnitude of this gap in the

ﬂoating-point number line increases with the magnitude of the numbers. The size of

the numeric gap between the number 1.0 and the next larger number distinct from

1.0 is called the machine epsilon and is calculated by the MATLAB function

eps

which MATLAB outputs as

ans =

2.2204e-016

Note that eps is the minimum value that must be added to 1.0 to result in another

number larger than 1.0 and is ∼2

–52

= 2.22 × 10

–16

, i.e. the incremental value of 1 bit

in the signiﬁcand’s rightmost (52nd) position. The limit of precision as given by eps

varies based on the magnitude of the number unde r consideration. For example,

eps(100.0)

ans =

1.4211e-014

which is obviously larger than the precision limit for 1.0 by two orders of magnitude

O(10

). If two numbers close in value have a difference that is smaller than their

smallest signiﬁcant digit, then the tw o are indistinguishable by the computer.

Figure 1.3 pictorially describes the concepts of underﬂow, overﬂow, and disconti-

nuity of double-precision ﬂoating-point numbers that are represented by a com-

puter. As the order of magnitude of the numbers increases, the discontinuity between

two ﬂoating-point numbers that are adjacent to each other on the ﬂoating-point

number line also becomes larger.

1.2 Representation of floating-point numbers

Binary significand – limits of precision

The number of bits allotted to the signiﬁcand governs the limits of precision. A binary

signiﬁcand represents fractional numbers in the base-2 system. According to IEEE

Standard 754 for ﬂoating-point representation, the computer maintains an implicitly

assumed bit (termed a hidden bit) of value 1 that precedes the binary point and does not

need to be stored (Tanenbaum, 1999). The 52 binary digits following the binary point are

allowed any arbitrary values of 0 and 1. The binary number can thus be represented as

1:b

...b

p1

 2

; (1:1)

where b stands for a binary digit, p is the maximum number of binary digits allowed

in the signiﬁcand based on the limits of precision, and k is the binary exponent.

Therefore, the signiﬁcand of every stored number has a value of 1.0 ≤ fraction < 2.0.

The fraction has the exact value of 1.0 if all the 52 bits are 0, and has a value just

slightly less than 2.0 if all the bits have a value of 1. How then is the value 0

represented? When all the 11 exponent bits and the 52 bits for the signiﬁcand are

of value 0, the implied or assumed bit value preceding the binary point is no longer

considered as 1.0. A 52-bit binary number corresponds to at least 15 digits in the

decimal system and at least 16 decimal digits when the binary value is a fractional

number. Therefore, any double-precision ﬂoating-point number has a maximum of

16 signiﬁcant digits and this deﬁnes the precision limit for this data type.

In Section 1.2.2 we looked at the method of converting binary numbers to decimal

numbers and vice versa. These conversion techniques will come in handy here to help

you understand the limits impos ed by ﬁnite precision. In the next few examples, we

disregard the implied or assumed bit of value 1 located to the left of the binary point.

Example 1.2 Convert the decimal numbers 0.6875 and 0.875 into its equivalent binary fraction

or binary significand.

The method to convert fractional decimal numbers to their binary equivalent is similar to the method for

converting integer decimal numbers to base-2 numbers. A fractional real number R can be expressed as

the sum of powers of 1/2 as shown:

R ¼ b





þ b





þ b





þþb





þ

such that

R ¼ 0:b

...b

...

base-10 fraction binary fraction

Figure 1.3

Simple schematic of the floating-point number line.

Underflow

errors

Numbers in this range can be

represented by a computer

using double precision

The floating-point number line is not continuous

Numbers in this range can be

represented by a computer

using double precision

Overflow errors (may

generate a floating-point

exception)

Overflow errors (may

generate a floating-point

exception)

~ –1.8 x 10

308

~ −1.1 x 10

–308

0 ~ 1.1 x 10

–308

~ 1.8 x 10

308

Types and sources of numerical error

where, b

, ..., b

are binary digits each of value 0 or 1. One method to express a decimal fraction in

terms of powers of 1/2 is to subtract successively increasing integral powers of 1/2 from the base-10

number until the remainder value becomes zero.

(1) 0.6875−(1/2)

= 0.1875, which is the remainder. The first digit of the significand b

is 1. So

0:1875 ð1=2Þ

0 ! b

¼ 0;

0:1875 ð1=2Þ

¼ 0:0625 ! b

¼ 1;

0:0625 ð1=2Þ

¼ 0:0 ! b

¼ 1 ðthis is the last digit of the significand that is not zero:Þ

Thus, the equivalent binary significand is 0.1011.

(2) 0:875 ð1=2Þ

¼ 0:375 ! b

¼ 1;

0:375 ð1=2Þ

¼ 0:125 ! b

¼ 1;

0:125 ð1=2Þ

¼ 0:0 ! b

¼ 1 ðthis is the last digit of the significand that is not zero:Þ

Thus, the equivalent binary significand is 0.111.

The binary equivalent of a number such as 0.7 is 0.1011 0011 0011 0011 ..., which terminates

indefinitely. Show this yourself. A finite number of binary digits cannot represent the decimal number

0.7 exactly but can only approximate the true value of 0.7. Here is one instance of round-off error. As

you can now see, round-off errors arise when a decimal number is substituted by an equivalent binary

floating-point representation.

Fractional numbers that may be exactly represented using a finite number of digits in the decimal

system may not be exactly represented using a finite number of digits in the binary system.

Example 1.3 Convert the binary significan d 0.10110 into its equivalent base-10 rational

number.

The decimal fraction equivalent to the binary significand 0.10110 in base 2 is

1 ð1

2Þ

þ 0 ð1

2Þ

þ 1 ð1

2Þ

þ 1 ð1

2Þ

þ 0 ð1

2Þ

¼ 0:6875:

See Table 1.2 for more examples demonstrating the conversion scheme.

Now that we have discussed the methods for converting decimal integers into

binary integers and decimal fractions into their binary signiﬁcand, we are in a

Table 1.2. Scheme for converting binary ﬂoating-point numbers to decimal numbers, where p = 4

and k = 0, −2

For these conversions, the binary ﬂoating-point number follows the format 0.b

... b

p-1

× 2

Binary signiﬁcand (p = 4) Conversion calculations Decimal number

0.1000 (k =0) (1× (1/2)

+0× (1/2)

) × 2

0.5

(k = −2) (1 × (1/2)

+0× (1/2)

) × 2

–2

0.125

0.1010 (k =0) (1× (1/2)

+0× (1/2)

+1× (1/2)

+0× (1/2)

) × 2

0.625

(k = −2) (1 × (1/2)

+0× (1/2)

+1× (1/2)

+0× (1/2)

) × 2

–2

0.156 25

0.1111 (k =0) (1× (1/2)

+1× (1/2)

) × 2

0.9375

(k = −2) (1 × (1/2)

+1× (1/2)

) × 2

–2

0.234 375

1.2 Representation of floating-point numbers

position to obtain the binary ﬂoating-point representation of a decimal number as

described by Equation (1.1). The following sequential steps describ e the method to

obtain the binary ﬂoating-point representation for a decimal number a.

(1) Divide a by 2

, where n is any integer such that 2

is the largest power of 2 that is less

than or equal to a, e.g. if a = 40, then n = 5 and a=2

¼ 40=32 ¼ 1:25. Therefore,

a ¼ a=2

 2

¼ 1:25  2

(2) Next, convert the decimal fraction (to the right of the decimal point) of the quotient

into its binary signiﬁcand, e.g. the binary signiﬁcand of 0.25 is 0.01.

(3) Finally, convert the decimal exponent n into a binary integer, e.g. the binary

equivalent of 5 is 101. Thus the binary ﬂoating-point representation of 40 is

1.01 × 2

101

Using MATLAB

The numeric output generated by MATLAB depends on the display formation

chosen, e.g. short or long. Hence the numeric display may include only a few digits

to the right of the decimal point. You can change the format of the numeric display

using the format command. Type help format for more information on the

choices available. Regardless of the display format, the internal mathematical

computations are always done in double precision, unless otherwise speciﬁed as,

e.g., single precision.

1.3 Methods used to measure error

Before we can fully assess the signiﬁcance of round-off errors produced by ﬂoating-

point arithmetic, we need to familiarize ourselves with standard methods used to

measure these errors. One method to measure the magnitude of error involves

determining its absolute value. If m

is the approximation to m, the true quantity,

then the absolute error is given by

¼ m

 m

: (1:2)

This error measurement uses the absolute difference between the two numbers to

determine the precision of the approximation. However, it does not give us a feel for

the accuracy of the approximation, since the absolute error does not compare the

absolute difference with the magnitude of m. (For a discussion on accuracy vs.

precision, see Box 1.2.) This is measured by the relative error ,

 mj

jmj

: (1:3)

Example 1.4 Er rors from repeated addition of a decimal fraction

The number 0.2 is equivalent to the binary significand 0:0011, where the overbar indicates the

infinite repetition of the group of digits located underneath it. This fraction cannot be stored exactly by

a computer when using floating-point representation of numbers. The relative error in the computer

representation of 0.2 is practically insignificant compared to the true value. However, errors involved in

binary floating-point approximations of decimal fractions are additive. If 0.2 is added to itself many times,

the resulting error may become large enough to become significant. Here, we consider the single data

type, which is a 32-bit representation format for floating-point numbers and has at most eight significant

digits (as opposed to the 16 significant digits available for double data types (64-bit precision).

Types and sources of numerical error