A mere report of relations observed in the past cannot be called knowledge; if knowledge is to
reveal objective relations of physical objects, it must include reliable predictions.
The purpose of any regression method is to extract, from a mass of data, information in
the form of a predictive mathematical equation. The sources of scientific and engineer-
ing data include experimental observations obtained in a laboratory or out in the field,
instrument output such as temperature, concentration, pH, voltage, weight, and tensile
strength, or quality control measurements such as purity, yield, shine, and hardness.
Engineering/scientific data sets contain many paired values, i.e. a dependent variable
paired with one or more independent variables. The topic of linear least-squares
regression as a curve-fitting method is discussed extensively in Chapter 2.In
Sections 2.8–2.11, we derived the normal equations, which upon solution yield esti-
mates of the parameters of the linear model, and introduced the residual, SSR (sum of
the squared residuals), and coefficient of determination R
2
. Note that for any mathe-
matical model amenable to linear least-squares regression, linearity of the equation is
with respect to the undetermined parameters only. The equation may have any form of
dependency on the dependent and independent variables. The method of linear regres-
sion involves setting up a mathematical model that is linear in the model parameters to
explain the trends observed in the data. If y depends on the value of x, the independent
variable, then the linear regression equation is set up as follows:
y ¼ β
1
f
1
xðÞþβ
2
f
2
xð Þþþβ
n
f
n
xðÞþE;
where f can be any function of x, and β
1
; β
2
; ...; β
n
are the model parameters. The
individual functions f
i
(x)(1≤ i ≤ n) may depend nonlinearly on x. The dependent
variable y is a random variable that exhibits some varia nce σ
2
about a mean value
μ
y
¼ β
1
f
1
xðÞþβ
2
f
2
xð Þþþβ
n
f
n
xðÞ. The term E represents the random error or
variability associated with y. When the number of data pairs in the data set (x, y)
exceeds the number of unknown parameters in the model, the resulting linear system
of equations is overdetermin ed. As discussed in Chapter 2 , a linear least-squares
curve does not pass through every ( x
i
, y
i
) data point.
The least-squares method is an estimation procedure that uses the limited data at
hand to quantify the parameters of the proposed mathematical model and conse-
quently yield the following predictive model:
^
y ¼ c
1
f
1
xðÞþc
2
f
2
xð Þþþc
n
f
n
xðÞ;
where
^
y are the model predictions of y. The hat symbol (^) indicates that the y values
are generated using model parameters derived from a data set. Our data are subject
to sampling error since the values of the regression-derived model coefficients are
dependent on the data sample. Quantifications of the regression equation coeffi-
cients c
1
; c
2
; ...; c
n
serve as estimates of the true parameters β
1
; β
2
; ...; β
n
of the
model. We use confidence intervals to quantify the uncertainty in using c
1
; c
2
; ...; c
n
as estimates for the actual model parameters β
1
; β
2
; ...; β
n
. Sometimes, one or more
of the c
i
parameters have a physical significance beyond the predictive model, in
which case the confidence interval has an additional importance when reporting its
best-fit value. Before we can proceed further, we must draw your attention to the
basic assumptions made by the linear regression model.
(1) The measur ed values of the independent variable x are known with perfect precision
and do not contain any error.
(2) Every y
i
value is normally distributed about its mean μ
y
i
with an unknown variance
σ
2
. The variance of y is independent of x, and is thus the same for all y
i
.
192
Probability and statistics