King M.R., Mody N.A. Numerical and Statistical Methods for Bioengineering: Applications in MATLAB

Подождите немного. Документ загружается.

2.9 Curve fitting using linear least-squares approximation

For any set of data that can be represented by a straight line, the dependent variable

y must vary with the n independent variables x

, x

, ..., x

in a linear fashion:

y ¼ β

þ β

þþβ

: (2:31)

The dependency on the n variables is deﬁned by the values of the n + 1 constant

coefﬁcients β

, β

, ..., β

. Most often, situations arise in which the number of data

points (x

, x

, ..., x

,y) m is greater than the number of undetermined coefﬁcients

n + 1. The system of linear equations consists of m equations in n + 1 variables (the

unknown coefﬁcients are the variables to be determined). Since m > n + 1, such a

system is overdetermined. No exact solution can be obtained. Let us consider the

simplest linear relationship y ¼ β

þ β

x, where the association of y is with only

one independent variable x. The number of unknowns in this equation is two. If the

m number of data points (x

, y

)1≤ i ≤ m that describe the linear trend is great er than

two, then one single line cannot pass through all these points unless all the data

points actually lie on a single line. Our goal is to determine the values of β

and β

a linear equation that produce the best approximation of the linear trend of the data

, y

)1≤ i ≤ m.

The quality of the ﬁt of the straight-line approximation is measured by calculating

the residuals. A residual is the difference between the actual data value y

at x

and

the corresponding approximated value predicted by the ﬁtted line

¼ β

þ β

where the hat symbol represents an approximation that is derived from regression.

Thus, the residual at x

is calculated as

¼ y

ðβ

þ β

Þ: (2:32)

It is assumed in the following development of the regression equations that the x

values (independent variable) are exactly known without error. The best-ﬁt can be

Figure 2.8

Transition from rotating to gliding motion. Plot of the initial angles of tilt about the y-axis that result in a transition of

platelet flow regime as a function of initial platelet centroid height from the surface.

0.45 0.5 0.55 0.6 0.65 0.7 0.75

Initial y-axis tilt,α

crit

(degrees)

Initial height H of platelet centroid (μm)

107

2.9 Curve fitting using linear least-squares method

determined by minimizing the calculated residuals according to a pre-deﬁned crite-

rion such as

criterion 1: minimum of

i¼1

;

criterion 2: minimum of

i¼1

;

criterion 3: minimum of

i¼1

The ﬂaw with criterion 1 is obvious. Residual values can be either positive or

negative. Even for a badly ﬁtted line, large positive residuals can cancel large

negative residuals, producing a sum nearly equal to zero and thereby masking a

bad approximation. Figure 2.9(a) shows a line that is clearly a bad ﬁt, yet gives

i¼1

¼ 0, thus satisfying the criterion.

The problem with criterion 2 becomes obvious when we try to take the derivatives

i¼1

dβ

and

i¼1

dβ

and equate them to zero in order to determine the value of the coefﬁcients at the

point of minimum of the sum of the absolute residuals. A unique solution does not

exist, i.e. multiple lines can satisfy this criterion. This is clearly evident from

Figure 2.9( b), which shows three lines, each equally satisfying criterion 2.

Figure 2.9

llustration of the difficulties faced in obtaining a best-fit for a line when using (a) minimization criterion 1 and

(b) minimization criterion 2.

Fitted straight line

(a)

Three fitted

straight lines

(b)

108

Systems of linear equations

Criterion 3 is called the least-squares ﬁt. The goal involves minimizing the sum of

the squared residuals, or SSR. The least-squares criterion provides a unique best-ﬁt

solution. In fact, in Euclidean R

space, where the norm of a vector is deﬁned by

Equation (2.9), determination of the least-s quares of r is synonymous with the

minimization of krk, and therefore speciﬁes the best possible ﬁt.

2.9.1 The normal equations

Now we delve into the mathematical derivation of the linear regression equations

based on the least-squares principle. The residual is deﬁned by Equation (2.32).

Criterion 3 is rewritten as

SSR ¼

i¼1

ðβ

þ β

ÞðÞ

: (2:33)

Minimization of the SSR is achieved by differentiating Equation (2.33) with respect

to each unknown coefﬁcient. The solution of the following system of equations

yields coefﬁcient values that provide the unique least-squares ﬁt to the data:

dSSR

dβ

i¼1

2 y

ðβ

þ β

ÞðÞ¼0;

dSSR

dβ

i¼1

2x

ðβ

þ β

ÞðÞ¼0:

After simpliﬁcation of the above two equations, we get

mβ

þ β

i¼1

; (2:34)

i¼1

þ β

i¼1

: (2:35)

Equations (2.34) and (2.35) are referred to as the normal equations. This name derives

from the geometric interpretation of these equations, which is discussed shortly.

This set of two equations can be expressed in the form of a single matrix equation:

i¼1

(2:36)

Solving Equations (2.34) and (2.35) simultaneously, we obtain

i¼1



i¼1



i¼1



(2:37)

and



y  β



x; (2:38)

A linear least-squares ﬁt is the best-ﬁt only if the phenomenon which the data describes adheres to certain

behaviors, such as independence of data values, normality of y distribution about its mean, and constant

variance of y over the range of x studied. These assumptions are discussed in detail in Chapter 3.

109

2.9 Curve fitting using linear least-squares method

where the means of x and y are deﬁned as



x ¼

i¼1

and



y ¼

i¼1

In fact, β

can be expressed in somewhat simpler form as follows:

i¼1





xðÞy





yðÞ

i¼1





xðÞ

: (2:39)

You are asked to verify the equivalence. (Hint: Expand Equation (2.39) and use

the deﬁnitio n of the mean to sim plify the terms.)

To illustrate the use of Equation (2.36) in determining the coefﬁcients of a

regression line, let’s consider a simple example.

Example 2.7

Determine the linear trend line for the following data set:

x ¼ 1:0; 2:0; 3:0; 4:0; 5:0; 6:0;

y ¼ 1:6; 4:2; 6:6; 8:7; 11:5; 13:7:

While a MATLAB program can be written to perform the necessary calculations, we will perform them by

hand for demonstration purposes.

First, we calculate the summations of x

, x

, y

and x

which can be conveniently represented in tabular

form (see Table 2.4).

Using Equation (2.36) we obtain the matrix equation

621:0

21:091:0



46:3

204:3



MATLAB can be used to solve the above linear problem:

A = [6 21.0; 21.0 91.0];

b = [46.3; 204.3];

c = A\b

-0.7333

2.4143

Thus, β

= –0.7333 and β

= 2.4143.

Next we wish to assess the quality of the fit by calculating the residuals. The values

y as predicted by the

regression line at the x values specified in the data set are determined using the polyval function. The

Table 2.4.

xyx

1.0 1.6 1.0 1.6

2.0 4.2 4.0 8.4

3.0 6.6 9.0 19.8

4.0 8.7 16.0 34.8

5.0 11.5 25.0 57.5

6.0 13.7 36.0 82.2

x = 21.0

y =46.3

=91.0

xy = 204.3

110

Systems of linear equations

polyval function accepts as its arguments the coefficients of a polynomial (specified in descending

order of powers of x) and the x values at which the polynomial is to be determined:

x = [1.0 2.0 3.0 4.0 5.0 6.0];

y = [1.6 4.2 6.6 8.7 11.5 13.7];

r=y– polyval([2.4143 -0.7333], x)

-0.0810 0.1047 0.0904 -0.2239 0.1618 -0.0525

These residuals will be used to obtain a single number that will provide a measure of the quality of the fit.

Equation (2.3 6) is compact and easy to evaluate. However, a drawback in the

method used for obtaining the individual equations (2.34) and (2.35) is that the

process must be repeated for every different functi on that relates y to its independent

variables. To emphasize this disadvantage, we now derive the normal equations for a

polynomial ﬁtted to data that relates x and y.

Often a polynomial function is used to describe the relationship between y and x.

The nature of the ﬁtted polynomial curve, i.e. the number of bends or kinks in the

curve, is a function of the order of the polynomial. A data set of n + 1 points can be

exactly represented by an nth-order polynomial (the nth-order polynomial will

pass through all n + 1 data points). If m any data points are known to exceedingly

good accuracy, then it may be useful to use higher-order polynomials. However, in

the majority of situations, the data set itself is only approximate, and forcing the

ﬁtted curve to pass through every point is not justiﬁed. As the order of the

polynomial increases, the ﬁtted curve may not necessarily connect the points

with a smooth curve, but may display erratic oscill atory behavior between the

data points. Unless the biological phenomenon or engineering application that the

data describes suggests such complicated dependencies, it is recommended that one

avoids using higher-order polynomials (fourth-order and higher) unless one is

justiﬁed in doing so. With higher-order polynomials, t he related normal equations

are prone to ill-conditioning and r ound-off errors, which corrupt the accuracy of

the solution.

We derive the normal equations for ﬁtting a second-order polynomial to an (x, y)

data set. The quadratic dependence of y on x is expressed as

y ¼ β

þ β

x þ β

The undetermined model parameters that we seek are β

, β

, and β

. The objective

function to be minimized is

SSR ¼

i¼1

ðβ

þ β



: (2:40)

The partial derivatives of Equation (2.40) are taken with respect to each of the three

unknown model parameters and are set equa l to zero to determine the equations that

simultaneously represent the minimum of the SSR:

dSSR

dβ

i¼1

2 y

ðβ

þ β



¼ 0;

dSSR

dβ

i¼1

2x

ðβ

þ β



¼ 0;

dSSR

dβ

i¼1

2x

ðβ

þ β



¼ 0:

111

2.9 Curve fitting using linear least-squares method

The ﬁnal form of the above three equations is as follows:

mβ

þ β

i¼1

þ β

i¼1

; (2:41)

i¼1

þ β

i¼1

þ β

i¼1

; (2:42)

i¼1

þ β

i¼1

þ β

i¼1

; (2:43)

which can be compactly represented as

i¼1

: (2:44)

Equation (2.44) constitutes the normal equations for a quadratic function in x, and has

different terms compared to the normal equations (Equation (2.36)) for a linear

function in x. In fact, the normal equations can be derived for any overdetermined

linear system, as long as the form of dependence of y on x involves a linear combina-

tion of functions in x. In other words, any functional relationship of the form

y ¼ β

ðÞ

þ β

ðÞ

þþβ

ðxÞ

can be ﬁtted to data using a linear least-squares approach. The individual functions

, f

, ..., f

can involve polynomial functions, trignometric functions, logarithms,

and so on. For example, using any of the following ﬁtting functions, a given data set

is amenable to linear least-squares regression:

y ¼

þ β

tanx;

y ¼

ﬃﬃﬃ

ðβ

sinx þ β

cosxÞ;

y ¼ β

ln x þ β

þ β

1=3

Rederiving the regression or normal equations for every new model is tedious and

cumbersome. A generalized method allows us to perform least-squares linear regres-

sion much more efﬁciently. Assume that the functional dependence of y on x can be

expressed as

y ¼ β

þ β

xð Þþþβ

xðÞ: (2:45)

If we have a data set that consists of m (x, y) pairs, then using Equation (2.45) we can

write m equations that are linear in n + 1 coefﬁcients. We can write this set of m

equations in compact matrix form. If we set

A ¼

1 f

ðÞ... f

ðÞ

1 f

ðÞ... f

ðÞ

1 f

ðÞ... f

ðÞ

; y ¼

;

112

Systems of linear equations

and c as the vector of the coefﬁcients β

, β

, ...,β

whose values we seek, then we

obtain the equation Ac = y. Here the number of equations (one equation for each

data point) is greater than the number of unknown coefﬁcients, i.e. m > n + 1. The

system of equations is overdetermined if the rank of A is less than the rank of the

augmented matrix [Ay]. In this case an exact solution does not exist since y is located

outside the column space of A. As a result, no vector of coefﬁcients c exists that

allows Ac to equal y exactly. Thus, y – Ac ≠ 0, for any c.

In order to determine a best-ﬁt, we must minimize the norm of the residual krk,

where r ¼ y 

y ¼ y  Ac. From the deﬁnition of the norm (Euclidean norm) given

by Equation (2.9), krk¼

ﬃﬃﬃﬃﬃﬃﬃﬃ

r  r

ﬃﬃﬃﬃﬃﬃﬃ

. We wish to ﬁnd the value of c that minimizes

r ¼

i¼1

, or the smallest value of the sum of squares of the individual residuals

corresponding to each data point. The linear least-squares regression equations

(normal equations) can be derived in two different ways.

Geometric m ethod

The product Ac produces a vector that is a linear comb ination of the column vectors

of A, and thus lies in the column space of A. To minimize r, we need to determine the

value of Ac that is closest to y. It can be shown (in Euclidean space) that the

orthogonal projection of y in the column space of A lies closest in value to y than

to any other vector in the column space of A. If we denote the orthogonal projection

of y in the column space of A as

y, and if

y ¼

y þ y

and

y?y

then the residual

r ¼ y

¼ y 

Figure 2.10 depicts

y in the column space of A. Any other vector equal to y – y

where y

is a non-orthogonal projection of y in the column space of A, as shown in

Figure 2.10, will necessarily be greater than y

⊥

We seek to minimize

r ¼ y

¼ y 

y ¼ y  Ac:

Figure 2.10

Geometrical interpretation of the least-squares solution, illustrating the orthogonal projection of y in the column

space of A.

Column space of A

y’

113

2.9 Curve fitting using linear least-squares method

Now, y

is normal or orthogonal to the column space of A (see Figure 2.10). The

column space of A = row space of A

. Therefore, y

is orthogonal to the row space of

. Mathematic ally, we write

 y

¼ 0orA

y  AcðÞ¼0

Ac ¼ A

y: (2:46)

Equation (2.46) is called the normal system, and the resulting set of equations is

called the normal equations. Why are these linear equations referred to as the

“normal equations”? The matrix equation given by Equation (2.46) describes the

condition that y – Ac must be normal to the column space of A in order to

obtain the set of values c that yields the best least-squares approximation of

y to

y. The normal equations have a unique solution if A



6¼ 0 and inﬁnitely many

solutions if A



¼ 0.

Equation (2.46) is exactly the same as Equation (2.36) for the case of regression of

a straight line to data. If y ¼ a

þ a

x and

A ¼

1 x

;

show that A

A and A

y produce the corresponding matrices given in Equation (2.36).

Algebraic Method

We seek the minimum of

krk

¼ r

r ¼ y 

yðÞ

y 

yðÞ¼y  AcðÞ

y  AcðÞ;

krk

¼ y

y  y

Ac  c

y þ c

Ac:

Each term in the above equation is a scalar number. Note that the transpose

operation follows the rule (AB)

= B

Determining the minimum is equivalent to ﬁnding the gradient of r

r with respect

to c and equating it to zero:



¼ 0  A

y  A

y þ A

Ac þ A

Ac ¼ 0 ;

where

∂

∂β

;

∂

∂β

; ...;

∂

∂β



and each term in the abo ve equation is a column vector. The ﬁnal form of the

gradient of the squared residual norm is given by

Ac ¼ A

We obtain the same system of normal equations as before.

114

Systems of linear equations

Two main steps are involved when applying the normal equations (Equation 2.46):

(1) construction of A, and

(2) solution of c ¼ðA

AÞ

1

y in MATLAB as c = (A*A)\A*y.

2.9.2 Coefficient of determination and quality of fit

The mean value



y can be used to represent a set of data points y

,andisauseful

statistic if the scatter can be explained by the stochastic nature of a process or

errors in measurement. If the data show a trend with respect to a change in a

property or condition (i.e. the independent variable(s)), then the mean value will

not capture the nature of this trend. A model ﬁtted to the data is expected to

capture the observed trend either wholly or partially. The outcome of the ﬁt will

depend on many factors, such as the type of model chosen and the magnitude of

error in the data. A “best-ﬁt” model is usually a superior way to approximate data

than simply stat ing the mean of all y

points, since the model contains at least one

adjustable coefﬁcient and a term that represents the nature of dependency of y on

the independent variable(s).

The difference (y





y) (also called deviation) measures the extent of deviation of

each data point from the mean. The sum of the squared difference

ðy





yÞ

is a

useful quantitative measure of the spread of the data about the mean. When a

ﬁtted function is used to approximate the data, the sum of the squared residuals is

krk

ðy



yÞ

. The coefﬁcient of determination R

is popularly used to deter-

mine the quality of the ﬁt; R

conveys the improvement attained in using the model

to describe the data, compared to using a horizontal line that passes throu gh the

mean. Calculation of R

involves comparing the deviations of the y data points from

the model prediction, i.e.

ðy



yÞ

, with the deviations of the y data points from

the mean, i.e.

ðy





yÞ

. Note that R

is a number between zero and one and is

calculated a s shown in Equation (2.47):

¼ 1 

ðy



ðy





yÞ

¼ 1 

krk

ðy





yÞ

: (2:47)

The summation subscripts have been omitted to improve readability.

ðy



yÞ

is much less than

ðy





yÞ

, then the model is a better approximation

to the data compared to the mean of the data. In that case, R

will be close to unity. If the

model poorly approximates the data, the deviations of the data points from the model

will be comparable to the data variation about the mean and R

will be close to zero.

Example 2.7 (continued)

We readdress the problem of fitting the data set to a straight line, but this time Equation (2.46) is used

instead of Equation (2.36). For this system of equations

A ¼

11:0

12:0

13:0

14:0

15:0

16:0

and y ¼

1:6

4:2

6:6

8:7

11:5

13:7

115

2.9 Curve fitting using linear least-squares method

Now, evaluating in MATLAB

A = [1 1; 1 2; 1 3; 1 4; 1 5; 1 6]; y = [1.6; 4.2; 6.6; 8.7; 11.5; 13.7];

c=(A’*A)\(A’*y)

-0.7333

2.4143

The elements of c are exactly equal to the slope and intercept values determined in Example 2.7. Next, we

calculate the coefficient of determination using MATLAB to determine how well the line fits the data:

ybar = mean(y)

ybar =

7.7167

x=[123456];

yhat = A*c

yhat’ =

1.6810 4.0952 6.5095 8.9238 11.3381 13.7524

SSR = sum((y – yhat(:)).^2) % sum of squared residuals

SSR =

0.1048

R2 = 1 – SSR/sum((y – ybar).^2 % coefﬁcient of determination

R2 =

0.9990

The straight-line fit is indeed a good approximation.

Using MATLAB

(1) Common statistical functions such as mean and std (standard deviation) are

included in all versions of MATLAB.

(2) The element-wise operator .^ calculates the square of each term in the vector, i.e. the

exponentiation operation is performed element-by-element.

(3) Calling a vector or matrix y(:) or A(:) retrieves all of the elements as a single

column. For matrices, this usage stacks the columns vertically.

Box 2.4B Platelet flow rheology

We wish to fit the function

crit

¼ β

þ β

H þ β

;

where α

crit

is the critical angle of tilt of the platelet at which the transition in flow regime occurs. Using

the data, we define the matrices

A ¼

and y ¼

crit;1

crit;2

crit;3

crit;4

crit;5

We then apply the normal equations to solve for the coefficients of the quadratic function. In MATLAB,

we type

116

Systems of linear equations