Statistics| Statistical Treatment| Data Analysis in Life Sciences | Chemistry Net

Statistics| Statistical Treatment| Data Analysis in Life Sciences



The major objectives of the application of statistics to Science and to chemical analysis are to determine the best value of a series of analytical results obtained with a particular sample and to give some indication of the reliability of the analysis.
As scientists, we are interested in finding results that apply to an entire population of people/ or material/ or chemical substance. For example biologists might be interested in processes that occur in all cells, chemists want to determine concentration of an analyte X in an entire batch of  a food product such as milk, economists want to build models that apply to all pensions.  
There is no doubt that it would be extremely expensive and impractical to analyze the entire population to draw conclusions about certain variables.
Therefore, in most cases we collect data from a small subset of the population (known as sample) and use these data to infer things about the population as a whole. This small sample’s  statistical measurements is the “model” for the entire population.
 The bigger the sample, the more likely it is to reflect the whole population. If we take several random samples from the population, each of these samples will give us slightly different results. However, on average, results from large samples should be fairly similar.


Simple Statistical Models – The mean, sum of squares, variance and standard deviations

One of the simplest models used in statistics is the mean. It can be calculated for any data set and represents a summary of data. The mean is a hypothetical value; it does not have to be a value that is actually observed in the data set. 
For example if we took 5 university professors and measured the number of graduate students they supervised we might find the following: 3, 3, 3, 4, 1. If we take the mean number of graduate students per professor we have:

Mean = 3 + 3 + 3 + 4 + 1 / 5 = 14/5 = 2.8

However, we do know that it is impossible to have 2.8 students so the mean value is a hypothetical value. As such, the mean is a model created to summarize our data.

Let us consider another example. Let’s suppose we would like to determine the concentration of aflatoxin M1 - a toxic substance - in a batch of milk. Seven samples are collected from the batch and are analyzed by HPLC/Fluoresense. The results are given below in Table I.1:



Table I.1: Concentration of Aflatoxin M1 (μg/kg) in 7 samples collected from a batch of milk

Sample Number
Aflatoxin M1 Concentration (μg/kg)
1
0.1097
2
0.1095
3
0.1099
4
0.1089
5
0.1091
6
0.1096
7
0.1095





The mean concentration value of Aflatoxin M1 in the above samples is calculated as follows:

Mean =  (0.1097 + 0.1095 + 0.1099 + 0.1089 + 0.1091 + 0.1096 + 0.1095) / 7 = 0.1094

As before, the mean is a model created to summarize our data.


How can we determine if the mean is an accurate model?

We can determine whether this is an accurate model by looking at the difference between the data we observed and the model fitted (i.e the mean). Graph #1 shows the assay results obtained for the concentration of aflatoxin in each sample and the mean number that we calculated earlier on. The line representing the mean (red line) can be thought as our model and the dots are the actual observed data. The distance (deviance) of each dot from the mean is the error in our model. We can calculate the magnitude of these deviances by simply subtracting the mean value x̅ from each of the observed values xi

Graph #1: Concentration of aflatoxin M1 in milk samples vs. sample number
Graph #1: Concentration of aflatoxin M1 vs. Sample Number



We can determine how accurate is our model by adding up all the deviances and this will give us an estimate of the total error:



Total error =  sum of deviances =  Σ (xi - x̅) = (0.1097 – 0.1094) + (0.1095 – 0.1094) + (0.1099 – 0.1094) + (0.1089 – 0.1094) + (0.1091 – 0.1094) + (0.1096 – 0.1094) + (0.1095 – 0.1094) = 0.0004
 ≈ 0

Calculation of the sum of deviances and the sum of squares of the concentration of aflatoxin in the samples of the example
Table I.2: Calculation of the sum of deviances Σ (xi - x̅) (Column 4) and the sum of squares (Column 6) of the concentration of  aflatoxin

In effect, the above result tells us that there is no total error between our model and the observed data and so the mean is a perfect representation of our data. However, this is not completely true since there were present positive and negative errors of the data from our mean that were cancelled out.
Therefore, a better way to show the total error is to square each error and then add them all. Then we get the so called sum of squared errors (SS):

Sum of squared errors (SS) = Σ (xi - x̅)  (xi - x̅)

In our example:

Sum of squared errors (SS) = Σ (xi - x̅)  (xi - x̅) =  Σ (xi - x̅)2 = (0.1097 – 0.1094)2 + (0.1095 – 0.1094)2 + (0.1099 – 0.1094)2 + (0.1089 – 0.1094)2 + (0.1091 – 0.1094)2 + (0.1096 – 0.1094)2 + (0.1095 – 0.1094)2 = 7.4 * 10-7

The sum of squared errors (SS) is a good measure of the accuracy of our model. However, it is obvious that is dependent upon the amount of data that has been collected – the more data points the higher the SS.  In order to overcome this problem we divide SS by the number of observations  N – in this example by the number of the samples (N=7). Since though we are not interested for the error in the concentration of aflatoxin  in a sample of milk but in the error  in the whole batch we divide SS by N-1 observations and we get the variance.
The variance is defined as follows:

variance (s2) = SS / (n-1) = Σ(xi - x̅)2 / (Ν-1)


Ιn our example the variance is equal to (see Table I.2):

variance (s2) = SS / (n-1) = Σ(xi - x̅)2 / (Ν-1) =  7.4 * 10-7 / 6 = 1.2 * 10-7

The variance is therefore the average error between the mean concentration of aflatoxin M1 calculated from the 7 samples and the concentration calculated for each individual sample.
The only problem with the variance as a measure is that it gives us a measure raised in a power of two. For this reason we usually take the square root of the variance and this measure is known as standard deviation (s). 

s = [Σ(xi - x̅)2 / (Ν-1)]1/2

In our example the standard deviation is equal to:

s = [Σ(xi - x̅)2 / (Ν-1)]1/2 = (7.4 * 10-7 / 6)1/2 = 3.5 * 10-4


No comments:

Post a Comment