King M.R., Mody N.A. Numerical and Statistical Methods for Bioengineering: Applications in MATLAB

Подождите немного. Документ загружается.

calculation, the combined error must also be calculated. The end result of a math-

ematical calculation that combines one or more random variables should be

reported in terms of the mean value of the calculated result and the standard

deviation associated with it. In this section, we derive the mathematical rules for

combining measurements associated with error.

Let x

and y

be two random variables or measurements with mean values of



x and



y and associated variances s

and s

, respectively, where

n  1

i¼1





xðÞ

and s

n  1

i¼1





yðÞ

;

1  i  n. Also let C

and C

be two known constants.

3.6.1 Addition/subtraction of random variables

Consider the general linear combination z

¼ C

þ C

. We would like to derive

the equations that relate



z and s

to our statistical estimates for x and y.

We begin by determining the mean value of z:



z ¼

i¼1

Substituting the functional dependence of z on x and y:



z ¼

i¼1

;



z ¼ C



x þ C



y: (3:36)

Next, we determine the variance associated with z:

n  1

i¼1





zðÞ

n  1

i¼1

ðz





(from Equat ion (3.30)). Squa ring z

¼ C

þ C

þ 2C

Squaring Equation (3.36),



¼ C



þ C



þ 2C



Substituting the expressions for z

and



into that for s

, we obtain

n  1

i¼1







þ C







þ 2C





yðÞ



¼ C

þ C

n  1

i¼1





yðÞ:

The last term in the above equation,

n  1

i¼1





xðÞy





yðÞ¼

n  1

i¼1





yðÞ;

is called the covariance of x and y. The covariance of x and y is deﬁned as

¼ Eð x

 μ

ðÞðy

 μ

ÞÞ.Ify increases as x increases, then the two variables are

187

3.6 Propagation of er ror

positively correlated and have a covariance greater than 0. If y decreases as x

increases, then the two variables are negatively correlated and have a covariance

less than 0.

When adding two random variables x and y such that z

¼ C

þ C

, the associated variance is

calculated as

¼ C

þ C

þ 2C

: (3:37)

If x and y are independent random variables, then the values of x have no bearing on

the observed values of y, and it can be easily shown that Eðs

Þ¼0 (see the

derivation of Equation (3.27)). In that case,

¼ C

þ C

: (3:38)

Often the covariance between two variables is not reported, and one is forced to

assume it is zero when using measured values reported in the literature. Nevertheless,

you should remain aware of the limitations of this assumption.

The mean and varia nce of a result that involv es a subtraction operation between

two random variables can be derived in a similar fashion. If z

¼ C

 C

, it can

be shown that



z ¼ C



x  C



y (3:39)

and

¼ C

þ C

 2C

: (3:40)

This is left to the reader as an exercise.

Estimates of the mean (expected value) and variance of a linear combination

of variables, which are calculated using Equations (3.36)–(3.40), do not require

the assumption of a normal distribution of the variables. However, if the

random variables are all normally distributed and independent, then it can be

shown that their linear combination is also normally distributed with the mean

given by an extension of Equation (3.36) and the variance given by an extension

of Equation (3.38). In other words, if all x

for i =1, ..., n are normally

distributed, and

z ¼

i¼1

;

then z is normally distributed with estimated mean and variance, respectively,



z ¼

i¼1



and s

i¼1

3.6.2 Multiplication/division of random variables

In this section, we calculate the mean and variance associated with the result of a

multiplication or division operation performed on two rand om variables. Let

¼ x

. We do not include C

and C

since they can be factored out as a single

term and then multiplied back for calculating the ﬁnal result. We write x

and y



x þ x

; y



y þ y

;

188

Probability and statistics

where x

; y

are the deviations of x and y from their means, respectively. For either of

the variables, the summation of all deviations equals 0 by deﬁnition, i.e.

i¼1

¼ 0;

i¼1

¼ 0. We assume that x

and y

are normally distributed and that

the relative errors S



x; S



y  1. This implies that x

; y

are both small quantities. We

have



x þ x



y þ y



1 þ x





1 þ y



yðÞ

Expanding 1 þ y





1

using the binomial expansion,



1 þ





1 











þ



1 þ







þ O



;





! !

;

and ignoring the second- and higher-order terms in the expansion, we get





1 þ









: (3:41)

Equation (3.41) expresses z

as an arithmetic sum of x



x and y



y, and is used to

derive expressions for



z and s



z ¼

i¼1



i¼1



1 þ









The last two terms in the above equation sum to 0, therefore



z 



: (3:42)

Now, calcul ating the associ ated variance, we obtain

n  1

i¼1

ðz





Þ¼

n  1

i¼1



1 þ



















n  1

i¼1









þ 2



 2



 2



;









n  1

i¼1







n  1

i¼1









n  1

i¼1

Now,

n  1

i¼1



; s

n  1

i¼1



; and s

n  1

i¼1

When dividing two random variables x and y such that z ¼ x=y, the variance associated with z is given by







 2



: (3:43)

189

3.6 Propagation of er ror

On similar lines one can derive the expressions for



z and s

when two random

variables are multiplied together.

If z

¼ x

and S



x; S



y  1, then



z 



y (3:44)

and



þ 2



: (3:45)

The derivation of Equations (3.44) and (3.45) is left to the reader as an exercise.

3.6.3 General functional relationship between two random variables

Let’s now consider the case where z ¼ fx; yðÞ, where f can be any potentially messy

function of x and y. As before, we express x an d y in terms of the mean value

and a deviation, x



x þ x

; y



y þ y

, and fx; yðÞis expanded using the two-

dimensional Taylor series:

; y

ðÞ¼f



yðÞþx





þy





þOx

; y

; x



Again, we assume that S



x; S



y  1, so that we can ignore the second- and

higher-order terms since they are much smaller than the ﬁrst-order terms. The

mean function value is calculated as follows:



z ¼

i¼1

; y

ðÞ

i¼1



yðÞþ





i¼1





i¼1

 f



yðÞ:

(3:46)

The variance associated with fx; yðÞis given by

n  1

i¼1









n  1

i¼1



yðÞþx





þy





 f



yðÞðÞ



n  1

i¼1





þ y





þ 2x









þ2f



yðÞx





þy





On opening the brackets and summing each term individually, the last term is found

to equal 0, so we have

 s





þ s





þ 2s









: (3:47)

Equation (3.47) is incredibly powerful in obtaining error estimates when combining

multiple random variables and should be committed to memory.

190

Probability and statistics

In this section we have derived methods to combine errors in mathematical

expressions containing two random variables. Generalizations for multidimensional

or n-variable systems are discussed in more advanced texts.

3.7 Linear regression error

Hans Reichenbach (1891–1953), a leading philosopher of science and a prominent

educator, stated in The Rise of Scientiﬁc Philosophy (Reichenbach, 1951), that

Box 3.8 Sphericity of red blood cells

Red blood cells vary considerably in size even within a single individual. The area (A) and volume (V)of

human red cells have been measured to be 130 ± 15.8 μm

and 96 ± 16.1 μm

, respectively (Waugh

et al., 1992). Cells can also be characterized by a sphericity S, defined as

S ¼

4π

4π=3ðÞ

2=3



2=3

Using micropipettes, the sphericity of red cells was measured to be 0.78 ± 0.02, after correcting for

minor changes to the area and volume due to temperature changes (Waugh et al., 1992). First, estimate

the variance in sphericity based on the mean values in area and volume, the reported standard

deviations in area and volume, and by neglecting the covariance s

between A and V. Next, solve for

by keeping the covariance term as an unknown variable and setting your estimate for s

equal to the

reported standard deviation of 0.02.

We make use of Equation (3.47) and neglect the covariance term. Calculating the first two terms on

the right-hand side of Equation (3.47), we obtain

4π

4π=3ðÞ

2=3





þ s





¼ 23:387  16:1

1=3





þ 15:8

V

2=3

2







¼ 0:008 þ 0:009 ¼ 0:017;

where f ¼ V

2=3

=A. The estimated variance in the sphericity of red cells s

neglecting the covariance

between A and V is 0.017. The covariance term has the value









¼ 0:02

 0: 017 ¼0:013:

Evaluating the left-hand side, we obtain

4π

4π=3ðÞ

2=3

1=3





V

2=3

2







¼0:013

 23:387  0:00112 0:00124 ¼0:013;

from which we obtain s

= 200.1 μm

. (In general, one should not take the square root of the

covariance, since it is not restricted to positive values.)

191

3.7 Linear regression error

A mere report of relations observed in the past cannot be called knowledge; if knowledge is to

reveal objective relations of physical objects, it must include reliable predictions.

The purpose of any regression method is to extract, from a mass of data, information in

the form of a predictive mathematical equation. The sources of scientiﬁc and engineer-

ing data include experimental observations obtained in a laboratory or out in the ﬁeld,

instrument output such as temperature, concentration, pH, voltage, weight, and tensile

strength, or quality control measurements such as purity, yield, shine, and hardness.

Engineering/scientiﬁc data sets contain many paired values, i.e. a dependent variable

paired with one or more independent variables. The topic of linear least-squares

regression as a curve-ﬁtting method is discussed extensively in Chapter 2.In

Sections 2.8–2.11, we derived the normal equations, which upon solution yield esti-

mates of the parameters of the linear model, and introduced the residual, SSR (sum of

the squared residuals), and coefﬁcient of determination R

. Note that for any mathe-

matical model amenable to linear least-squares regression, linearity of the equation is

with respect to the undetermined parameters only. The equation may have any form of

dependency on the dependent and independent variables. The method of linear regres-

sion involves setting up a mathematical model that is linear in the model parameters to

explain the trends observed in the data. If y depends on the value of x, the independent

variable, then the linear regression equation is set up as follows:

y ¼ β

xðÞþβ

xð Þþþβ

xðÞþE;

where f can be any function of x, and β

; β

; ...; β

are the model parameters. The

individual functions f

(x)(1≤ i ≤ n) may depend nonlinearly on x. The dependent

variable y is a random variable that exhibits some varia nce σ

about a mean value

¼ β

xðÞþβ

xð Þþþβ

xðÞ. The term E represents the random error or

variability associated with y. When the number of data pairs in the data set (x, y)

exceeds the number of unknown parameters in the model, the resulting linear system

of equations is overdetermin ed. As discussed in Chapter 2 , a linear least-squares

curve does not pass through every ( x

, y

) data point.

The least-squares method is an estimation procedure that uses the limited data at

hand to quantify the parameters of the proposed mathematical model and conse-

quently yield the following predictive model:

y ¼ c

xðÞþc

xð Þþþc

xðÞ;

where

y are the model predictions of y. The hat symbol (^) indicates that the y values

are generated using model parameters derived from a data set. Our data are subject

to sampling error since the values of the regression-derived model coefﬁcients are

dependent on the data sample. Quantiﬁcations of the regression equation coefﬁ-

cients c

; c

; ...; c

serve as estimates of the true parameters β

; β

; ...; β

of the

model. We use conﬁdence intervals to quantify the uncertainty in using c

; c

; ...; c

as estimates for the actual model parameters β

; β

; ...; β

. Sometimes, one or more

of the c

parameters have a physical signiﬁcance beyond the predictive model, in

which case the conﬁdence interval has an additional importance when reporting its

best-ﬁt value. Before we can proceed further, we must draw your attention to the

basic assumptions made by the linear regression model.

(1) The measur ed values of the independent variable x are known with perfect precision

and do not contain any error.

(2) Every y

value is normally distributed about its mean μ

with an unknown variance

. The variance of y is independent of x, and is thus the same for all y

192

Probability and statistics

(3) The means of the dependent variable y

, for the range of x

values of interest, obey the

proposed model μ

¼ β

ðÞþβ

ð Þþþβ

ðÞ.

(4) A y value at any observed x is randomly chosen from the population distribution of y

values for that value of x, so that the random sampling model holds true.

(5) The y measurements observed at different x values are independent of each oth er so

that their covariance σ

¼ Eððy

 μ

Þðy

 μ

ÞÞ ¼ 0.

The departure of each y measurement from its corresponding point on the regression

curve is called the residual, r

¼ y



ðÞ. For the least-squares approximation, it can

be easily shown that the sum of the residuals

i¼1



ðÞ¼0, where m is

the number of data points in the data set.

The residual variance measures the variability of the data points about the regression curve and is

defined as follows:

m  n

i¼1



ðÞ

k r k

m  n

; (3:48)

where m is the number of data pairs and n is the number of undetermined model parameters. Note that

k r k

is the sum of the squared residuals (SSR) (the subscript indicates p = 2 norm).

The degrees of freedom available to estimate the residual variance is given by (m – n).

Since n data pairs are required to ﬁx the n parameter values that deﬁne the regression

curve, the number of degrees of freedom is reduced by n when estimating the

variance. If there are exactly n data points, then all n data points are exhausted in

determining the parameter values, and zero degrees of freedom remain to estimate

the variance. When the residual standard deviation s

is small, the data points are

closely situated near the ﬁtted curve. A large s

indicates considerable scatter about

the curve; s

serves as the estimate for the true standard deviation σ of y. Since the

data points are normally distributed vertically about the curve, we expect approx-

imately 68% of the points to be located within ±1SD (standard deviation) from the

curve and about 95% of the points to be situated within ±2SD.

3.7.1 Error in model parameters

Earlier, we present ed ﬁve assumptions required for making statistical predictions

using the linear least-squares regression model. We also introduced Equation (3.48),

which can be used to estimate the standard deviation σ of the normally distributed

variable y. W ith these developments in hand, we now proceed to estimate the

uncertainty in the calculated model parameter values.

The equation that is ﬁtted to the data,

y ¼ β

xðÞþβ

xðÞþþβ

xðÞ;

can be rewritten in compact matrix form Ac = y, for m data points, where

A ¼

ðÞf

ð Þ  f

ðÞ

ðÞf

ð Þ  f

ðÞ

ðÞf

ð Þ  f

ðÞ

; c ¼

; y ¼

193

3.7 Linear regression error

Box 3.9A Kidney functioning in human leptin metabolism

Typically, enzyme reactions

E þ S ! E þ P

exhibit Michaelis–Menten (“saturation”) kinetics, as described by the following equation:

R ¼

max

þ S

Here, R is the reaction rate in concentration/time, S is the substrate concentration, R

max

is the maximum

reaction rate for a par ticular enzyme concentration, and K

is the Michaelis constant; K

can be thought

of as the reactant concentration at which the reaction rate is half of the maximum R

max

. This kinetic

equation can be rearranged into linear form:

max

Thus, a plot of 1/R vs. 1/S yields a line of slope K

max

and intercept 1=R

max

. This is referred to as the

Lineweaver–Burk analysis.

The role of the kidney in human leptin metabolism was studied by Meyer et al.(1997). The data

listed in Table 3.6 were collected for renal leptin uptake in 16 normal post-operative volunteers with

varying degrees of obesity.

In this case the arterial plasma leptin is taken as the substrate concentration S (ng/ml), and the renal

leptin uptake represents the reaction rate R (nmol/min).

We first perform a linear regression on these data using the normal equations to determine the

Michaelis constant and R

max

. The matrices of the equation Ac = y are defined as follows:

Table 3.6.

Subject Arterial plasma leptin

Renal leptin

uptake

1 0.75 0.11

2 1.18 0.204

3 1.47 0.22

4 1.61 0.143

5 1.64 0.35

6 5.26 0.48

7 5.88 0.37

8 6.25 0.48

9 8.33 0.83

10 10.0 1.25

11 11.11 0.56

12 20.0 3.33

13 21.74 2.5

14 25.0 2.0

15 27.77 1.81

16 35.71 1.67

194

Probability and statistics

A ¼

; y ¼

The coefficient vector

c ¼

max

¼ A



1

Performing this calculation in MATLAB (verify this yourself), we obtain

max

¼ 6:279; c

max

¼ 0:5774:

Also

SSR ¼k r k

i¼1



ðÞ

¼ 12:6551;

where

y ¼ 1=

R ¼ Ac. The coefficient of determination R

= 0.8731, thus the fit is moderately good.

Since m = 16 and n =2,

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

12:6551

16  2

¼ 0:9507:

We assume that the y

¼ 1=R

values are normally distributed about their mean

with a standard

deviation estimated to be 0.9507. Next, we plot in Figure 3.17 the transformed data, the best-fit model,

and curves that represent the 95% confidence interval for the variability in the y data. Since the residual

standard deviation is estimated with 16 – 2 = 14 degrees of freedom, the 95% confidence interval is

given by

y  t

0:975; f¼14

Using the MATLAB tinv function, we find that t

0:975; f¼14

¼ 2:145.

We expect the 95% confidence interval to contain 95% of all observations of 1=R for the range of substrate

concentrations considered in this experiment. Out of 16, 15 of the observations are located within the

two dashed lines, or 15=16 ~ 94% of the observations are contained within the 95% confidence region.

Figure 3.17

Plot of the transformed data, the linear least-squares fitted curve (solid line), and the 95% confidence interval for y

(situated within the two dashed lines).

0 0.5 1 1.5

−2

1/R (min/nmol)

1/S (ml/ng)

Data points: 1/R

Fitted linear curve

95% confidence interval

195

3.7 Linear regression error

The best-ﬁt least-squares solution c

; c

; ...; c

for the undetermined parameters

; β

; ...; β

is given by the normal equations (see Section 2.9) c ¼ A



1

Since A is an m × n matrix, A

is therefore an n × m matrix, and A

A and A



1

are n × n matr ices. The matrix product A



1

produces an n × m

matrix which undergoes matrix-vector multiplication with y,anm × 1 vector. Let

K ¼ A



1

. We can then write

c ¼ Ky or c

j¼1

; for 1  i  n:

Each coefﬁcient c

of the model is equal to a linear combination of m normally

distributed random variables. It follows that c

will be normally distributed.

Equation (3.37) guides us on how to combine the variances s

associated with

each random variable y

. The second assumption in our list states that the variance

is the same for all y. Now, s

is our estimate for σ

,so

j¼1

: (3:49)

Equation (3.49) is used to estimate the variance or uncertainty associated with the

calculated model coefﬁcients.

3.7.2 Error in model predictions

The utility of regression models lies in their predictive nature. Mathematical models

ﬁtted to data can be used to predict the mean value of y for any given x

that lies

within the range of x values for which the model is deﬁned. As noted earlier, an

actual observed value of y varies about its true mean with a variance of σ

predicted value of y at x

will have a mean

¼ Eyjx

ðÞ¼β

ðÞþβ

ð Þþþβ

ðÞ:

The best prediction we have of the mean of y at x

, when no actual observation of y

at x

has been made, is

¼ c

ðÞþc

ð Þþþc

ðÞ:

The model prediction of the mean

is not perfect because the estimated regres-

sion model parameters a re themselves associated with a level of uncertainty

quantiﬁed by their respective conﬁdence intervals. To convey our conﬁdence in

the prediction of

, we must specify conﬁdence intervals that bracket a range in

which we expect the actual mean of y

to lie. Although y at any x is associated with

a variance of σ

, our prediction of the mean at x = x

, will be associated

with an error t hat will depend on the uncertainty in the model coefﬁcients. Let’s

now calculate the variance in the predicted value

, which serves as our estimate

ofthetruemeanμ

Suppose the linear regression model is

y ¼ c

xðÞþc

xðÞ. Using Equat ion

(3.37), we calculate the variance associated with

¼ f

ðÞðÞ

þ f

ðÞðÞ

þ 2f

ðÞf

ðÞs

The covariance term cannot be dropped because s

6¼ 0. We must determine the

covariance in order to proceed. We have

196

Probability and statistics