The major objectives of the application of statistics to Science and to chemical analysis are to determine the
best value of a series of analytical results obtained with a particular sample and to give some
indication of the reliability of the analysis.
As scientists, we are interested in finding results that apply to an
entire population of people/ or material/ or chemical substance. For example
biologists might be interested in processes that occur in all cells, chemists
want to determine concentration of an analyte X in an entire batch of a food product such as milk, economists want
to build models that apply to all pensions.
There is no doubt that it would be extremely expensive and
impractical to analyze the entire population to draw conclusions about certain
variables.
Therefore, in most cases we collect data from a small subset of the
population (known as sample) and use these data to infer things about the
population as a whole. This small sample’s
statistical measurements is the “model” for the entire population.
The bigger the sample, the
more likely it is to reflect the whole population. If we take several random
samples from the population, each of these samples will give us slightly
different results. However, on average, results from large samples should be
fairly similar.
Simple Statistical Models – The mean, sum of squares, variance and
standard deviations
One of the simplest models used in statistics is the mean. It can be calculated for any data
set and represents a summary of data. The mean is a hypothetical value; it does
not have to be a value that is actually observed in the data set.
For example if we took 5 university professors and measured the
number of graduate students they supervised we might find the following: 3, 3,
3, 4, 1. If we take the mean number of graduate students per professor we have:
Mean = 3 + 3 + 3 + 4 + 1 / 5 = 14/5 = 2.8
However, we do know that it is impossible to have 2.8 students so
the mean value is a hypothetical value. As such, the mean is a model created to
summarize our data.
Let us consider another example. Let’s suppose we would like to
determine the concentration of aflatoxin M1 - a toxic substance - in a batch of milk. Seven
samples are collected from the batch and are analyzed by HPLC/Fluoresense. The
results are given below in Table I.1:
Table I.1: Concentration of Aflatoxin M1 (μg/kg) in 7 samples
collected from a batch of milk
Sample Number
|
Aflatoxin M1 Concentration (μg/kg)
|
1
|
0.1097
|
2
|
0.1095
|
3
|
0.1099
|
4
|
0.1089
|
5
|
0.1091
|
6
|
0.1096
|
7
|
0.1095
|
The mean concentration value of Aflatoxin M1 in the above samples is
calculated as follows:
Mean = (0.1097 + 0.1095 + 0.1099 +
0.1089 + 0.1091 + 0.1096 + 0.1095) / 7 = 0.1094
As before, the mean is a model created to summarize our data.
How can we determine if the mean is an accurate model?
We can determine whether this is an accurate model by looking at the
difference between the data we observed and the model fitted (i.e the mean).
Graph #1 shows the assay results obtained for the concentration of aflatoxin in
each sample and the mean number that we calculated earlier on. The line
representing the mean (red line) can be thought as our model and the dots are
the actual observed data. The distance (deviance)
of each dot from the mean is the error in our model. We can calculate the
magnitude of these deviances by simply subtracting the mean value x̅ from each of the observed values xi
Graph #1: Concentration of aflatoxin M1 vs. Sample Number |
We can determine how accurate is our model by adding up all the
deviances and this will give us an estimate of the total error:
Total error = sum of deviances = Σ (xi - x̅) = (0.1097 – 0.1094) + (0.1095 – 0.1094) + (0.1099 – 0.1094)
+ (0.1089 – 0.1094) + (0.1091 – 0.1094) + (0.1096 – 0.1094) + (0.1095 – 0.1094)
= 0.0004
≈ 0
Table I.2: Calculation of the sum of deviances Σ (xi - x̅) (Column 4) and the sum of squares (Column 6) of the concentration of aflatoxin |
In effect, the above result tells us that there
is no total error between our model and the observed data and so the mean is a
perfect representation of our data. However, this is not completely true since
there were present positive and negative errors of the data from our mean that
were cancelled out.
Therefore, a better way to show the total error
is to square each error and then add them all. Then we get the so called sum of squared errors (SS):
Sum of squared errors (SS) = Σ (xi - x̅) (xi - x̅)
In our example:
Sum of squared errors (SS) = Σ (xi - x̅) (xi - x̅) = Σ (xi - x̅)2 =
(0.1097 – 0.1094)2 + (0.1095 – 0.1094)2 + (0.1099 –
0.1094)2 + (0.1089 – 0.1094)2 + (0.1091 – 0.1094)2
+ (0.1096 – 0.1094)2 + (0.1095 – 0.1094)2 = 7.4 * 10-7
The sum of
squared errors (SS) is a good measure of the accuracy of our model. However, it
is obvious that is dependent upon the amount of data that has been collected –
the more data points the higher the SS. In order to overcome this problem we divide SS
by the number of observations N – in
this example by the number of the samples (N=7). Since though we are not
interested for the error in the concentration of aflatoxin in a sample of milk but in the error in the whole batch we divide SS by N-1
observations and we get the variance.
The variance is defined as follows:
variance (s2) =
SS / (n-1) = Σ(xi - x̅)2 / (Ν-1)
Ιn our example the
variance is equal to (see Table I.2):
variance (s2) =
SS / (n-1) = Σ(xi - x̅)2 / (Ν-1)
= 7.4 * 10-7 / 6 = 1.2 * 10-7
The variance is
therefore the average error between the mean concentration of aflatoxin M1
calculated from the 7 samples and the concentration calculated for each
individual sample.
The only problem
with the variance as a measure is that it gives us a measure raised in a power
of two. For this reason we usually take the square root of the variance and
this measure is known as standard
deviation (s).
s = [Σ(xi -
x̅)2 / (Ν-1)]1/2
In our example the standard deviation is equal to:
s = [Σ(xi -
x̅)2 / (Ν-1)]1/2 = (7.4 * 10-7 / 6)1/2
= 3.5 * 10-4
No comments:
Post a Comment