Variance calculation
Author Message
Variance calculation

I'm struggling with an awk script that I want to use to calculate some
basic statistics like weighted means and variances.  The script should
be able to calculate means and variances for each unique group within an
array and also calculate the same stats for the entire data set.  My
problem has been looping through the data properly.  The attached script
properly calculates the weighted means for each group and the total data
set, but I am screwing up with the variance calculation.

The variance calculation can only be made after the means for each group
and the total population have been made.  Then the variance calculation
should be:  the sum of the squares for each value minus the mean * the
weight of the sample.  This sum is then divided by the sum of the
weights.

Here is my script:

{
if (\$2 > 0) {

item   = \$1        # Column no. of item
weight = \$2        # Column no. of weight
value  = \$3        # Column no. of value
itm[item]++        # Count of items by type
count++            # Count of all items

wt[item] = \$2      # Weight of items by type
iwt[item] += \$2    # Sum of weight by item
val[item] = \$3     # Value of item by type
sumwt[item] += \$2  # Sum of weights by item
totwt += \$2        # Total weight of all items

#  Sum of value*weight
#  ====================
valwt[item] += \$2*\$3

#  Weighted mean by item
#  =====================
imean[item] = valwt[item] / iwt[item]

#  Global weighted mean
#  =====================

gwtval += \$2 * \$3
gmean = gwtval / totwt

}
}

{
if (\$1 > 0) {
for ( i = 1; i <= NR; i++)

#  Variance by item
#  =====================

tmpvar[item] += (val[item] - imean[item])^2 * wt[item]
ivar[item] = tmpvar[item] / iwt[item]

#  Global variance
#  =====================

tmp_var +=  (\$3 - gmean)^2 * \$2
gvar = tmp_var / totwt

}

Quote:
}

END  {

print "Item                Number     Mean   Variance"
print "  "

for ( each in itm )

printf "%-15s %8.0f %10.3f %10.4f \n", each,itm[each],imean[each],\
ivar[each] | "sort"
print "  "
printf "%-15s %8.0f %10.3f %10.4f \n", "Total", count, gmean,gvar

Quote:
}

A dummy data set could look like this:

Item       Wt.    Value

Apples      50    2.00
Apples     100    1.00
Lemons      50    6.00
Lemons     100    3.00
Oranges     50    4.00
Oranges    100    2.00

Again, my problem is that I don't know how to properly loop through the
data set twice.  I've tried using do and for loops, but have failed
miserbly.

Any responses can be sent to this forum or at my email address:

Mike Lechner

Sent via Deja.com http://www.*-*-*.com/

Sat, 08 Mar 2003 08:13:41 GMT
Variance calculation

Quote:

>I'm struggling with an awk script that I want to use to calculate some
>basic statistics like weighted means and variances.  The script should
>be able to calculate means and variances for each unique group within
>an array and also calculate the same stats for the entire data set.  My
>problem has been looping through the data properly.  The attached
>script properly calculates the weighted means for each group and the
>total data set, but I am screwing up with the variance calculation.

<snip>

data files once, in sequential order. If you could read through a file
twice by putting its filename on the commandline twice and adding logic
to distinguish between first and second pass. Or you could do perform
explicit file input in the BEGIN block, thus avoiding technically
unnecessary comparisons. But the best approach may be collecting data
into arrays in the implicit input loop, then performing calculations in
the END action. A hybrid would be performing one calculation as well as
storing in arrays during the implicit input loop, then performing the
other calculation(s) in the END action.

awk '{ mean += d[++n] = \$1 }
END {
mean /= n
for (i = 1; i <= n; ++i) var += (d[i] - mean) ^ 2
var /= (n - 1)
print mean, var

Quote:
}' input

[Note: sample rather than population variances above.]

Sent via Deja.com http://www.deja.com/

Sat, 08 Mar 2003 09:25:45 GMT
Variance calculation
Hi Mike...
you need to re-work the calculation to perform it in a single pass
(Harlan rulz! :)  Here is a chunk of code I use for a simple
regression computation (which includes the values you are after),
on groups within a file.
The input file MUST be sorted by the grouping (key) field.
regression data is dumped at the end of each group, and at the end
of the input file.
oh.. and this one is driven using -v parameter passing, so check the
BEGIN section for the parameter checking and allocation code.
(check for news client line wraps before use)

Enjoy,
Jennifer

--  begin sample code --
# Copyright University of Western Australia
# Permission for use of this code is granted subject to GNU copyleft provisions
#
#routine to calulate a simple 2 variable regression
# output is the Alpha and Beta cofficients for each group
# input is a file of comma seperated values, 3 of which are used
# group_col - the column containing a group identifier
# x_col - first variable
# y_col - second variable
# missing_value_flag -  also can define a "missing value" figure which
# causes a case to be skipped if either variable caontains the
# missing value marker
# verbose - if specified causes the internal parameters to be printed
BEGIN {
FS=",";OFS=","
#defaults
mvf=-1
gfield=0
xfield=2
yfield=3
#check for specified missing value marker
if(missing_value_flag!="") {mvf=missing_value_flag}
#check for user assigned columns
if(group_col!="") {
if(group_col+0==0) {
print "group_col invalid, must be digit >0 or leave blank for whole file" > "/dev/stderr"
exit
}
gfield=group_col+0
}
if(x_col!="") {
if(x_col+0==0) {
print "x_col invalid, must be digit >0" > "/dev/stderr"
exit
}
xfield=x_col+0
}
if(y_col!="") {
if(y_col+0==0) {
print "y_col invalid, must be digit >0" > "/dev/stderr"
exit
}
yfield=y_col+0
}
#check if they asked for the internal parameter dump
if(verbose!="") {vbose=1}

print "doing simple regression on columns " xfield " and " yfield " grouped by " gfield > "/dev/stderr"
if(mvf!=""){print "Ignoring cases where either value field is \"" mvf "\"" > "/dev/stderr"}
if(vbose==1) {print "internal parameters x,x2,xy,y,y2 will be printed" > "/dev/stderr" }

Quote:
} # end of setup, begin processing

gfield != 0 && \$(gfield)!=oldg{
beta="NA"
alpha="NA"
if(n>0) {
invn=1/n
denom=x2-invn*(x*x)
if(denom!=0){
beta=(xy-(invn*x*y))/denom
alpha=y/n - beta*(x/n)
}
}
if (vbose!=1) {
print oldg "," n "," beta "," alpha
} else {
print oldg "," n "," beta "," alpha "," x "," x2 "," xy "," y "," y2
}
oldg=\$(gfield)
n=0
x=0
x2=0
y=0
y2=0
xy=0
Quote:
}

\$(xfield) != mvf && \$(yfield) != mvf {
n++
x+=\$(xfield)
y+=\$(yfield)
x2+=\$(xfield)*\$(xfield)
y2+=\$(yfield)*\$(yfield)
xy+=\$(yfield)*\$(xfield)
#  print oldg "," n "," x "," x2 "," xy "," y "," y2 ":" \$0
Quote:
}

END {
beta="NA"
alpha="NA"
if(n>0) {
invn=1/n
denom=x2-invn*(x*x)
if(denom!=0){
beta=(xy-(invn*x*y))/denom
alpha=y/n - beta*(x/n)
}
}
if (vbose!=1) {
print oldg "," n "," beta "," alpha
} else {
print oldg "," n "," beta "," alpha "," x "," x2 "," xy "," y "," y2
}
Quote:
}

Sat, 08 Mar 2003 09:58:12 GMT
Variance calculation

writes

Quote:
>I'm struggling with an awk script that I want to use to calculate some
>basic statistics like weighted means and variances.  The script should
>be able to calculate means and variances for each unique group within an
>array and also calculate the same stats for the entire data set.  My
>problem has been looping through the data properly.  The attached script
>properly calculates the weighted means for each group and the total data
>set, but I am screwing up with the variance calculation.

>The variance calculation can only be made after the means for each group
>and the total population have been made.  Then the variance calculation
>should be:  the sum of the squares for each value minus the mean * the
>weight of the sample.  This sum is then divided by the sum of the
>weights.

>Here is my script:

>{
>if (\$2 > 0) {

>   item   = \$1        # Column no. of item
>   weight = \$2        # Column no. of weight
>   value  = \$3        # Column no. of value
>   itm[item]++        # Count of items by type
>   count++            # Count of all items

>   wt[item] = \$2      # Weight of items by type
>   iwt[item] += \$2    # Sum of weight by item
>   val[item] = \$3     # Value of item by type
>   sumwt[item] += \$2  # Sum of weights by item
>   totwt += \$2        # Total weight of all items

>#  Sum of value*weight
>#  ====================
>   valwt[item] += \$2*\$3

>#  Weighted mean by item
>#  =====================
>   imean[item] = valwt[item] / iwt[item]

>#  Global weighted mean
>#  =====================

>   gwtval += \$2 * \$3
>   gmean = gwtval / totwt

>   }
> }

>{
>if (\$1 > 0) {
>   for ( i = 1; i <= NR; i++)

>#  Variance by item
>#  =====================

>   tmpvar[item] += (val[item] - imean[item])^2 * wt[item]
>   ivar[item] = tmpvar[item] / iwt[item]

>#  Global variance
>#  =====================

>   tmp_var +=  (\$3 - gmean)^2 * \$2
>   gvar = tmp_var / totwt

>      }
>}
>END  {

>  print "Item                Number     Mean   Variance"
>  print "  "

>  for ( each in itm )

>  printf "%-15s %8.0f %10.3f %10.4f \n", each,itm[each],imean[each],\
>  ivar[each] | "sort"
>  print "  "
>  printf "%-15s %8.0f %10.3f %10.4f \n", "Total", count, gmean,gvar
>}

>A dummy data set could look like this:

>Item       Wt.    Value

>Apples      50    2.00
>Apples     100    1.00
>Lemons      50    6.00
>Lemons     100    3.00
>Oranges     50    4.00
>Oranges    100    2.00

>Again, my problem is that I don't know how to properly loop through the
>data set twice.  I've tried using do and for loops, but have failed
>miserbly.

>Any responses can be sent to this forum or at my email address:

>Mike Lechner

>Sent via Deja.com http://www.deja.com/

Hi Mike,

It is possible to rearrange the equations for weighted mean and weighted
variance so that you can calculate the variance without knowing the mean
in advance. To do this you need to calculate some summations as you go
along but you only have to read the data once. This program shows you
how it's done.

#!gawk -f
#statistics

NF==3 {
c[\$1]++             # count
w[\$1]+=\$2           # sum(weight)
wx[\$1]+=\$2*\$3       # sum(weight*value)
wxx[\$1]+=\$2*\$3*\$3   # sum(weight*value^2)

Quote:
}

END {
print "Fruit Statistics\n"
printf "%10s %5s %9s %9s\n", "item","count","mean","variance"
fmt="%10s %5d %9.4f %9.4f\n"
for (item in c) {
m=wx[item]/w[item] # weighted mean
v=(wxx[item]-wx[item]*m)/w[item] # weighted variance
printf fmt, item,c[item],m,v
tc+=c[item]       # total count
tw+=w[item]       # total sum(weight)
twx+=wx[item]     # total sum(weight*value)
twxx+=wxx[item]   # total sum(weight*value^2)
}
tm=twx/tw           # total weighted mean
tv=(twxx-twx*tm)/tw # total weighted variance
print ""
printf fmt, "Total",tc,tm,tv

Quote:
}

I used your data and got this result:-

Fruit Statistics

item count      mean  variance
Oranges     2    2.6667    0.8889
Lemons     2    4.0000    2.0000
Apples     2    1.3333    0.2222

Total     6    2.6667    2.2222

The definitions I have used are:-

mean     = sum( weight * value ) / sum( weight )

variance = sum( weight * (value - mean) ^ 2 ) / sum( weight )

If all the weights were 1 these would reduce to the usual definitions of
population mean and variance.

I rearranged the equation for variance as:-

variance = (sum(weight*value^2) - mean*sum(weight*value)) / sum(weight)

Note that, in this form of the equation, the mean does not appear inside
a summation so you don't need to know the mean while doing the
summations.

This rearrangement is not difficult to verify.

Hope this helps
--
Alan Linton

Sat, 08 Mar 2003 03:00:00 GMT
Variance calculation

<snip>

Quote:
>It is possible to rearrange the equations for weighted mean and
>weighted variance so that you can calculate the variance without
>knowing the mean in advance. To do this you need to calculate some
>summations as you go along but you only have to read the data once.

<snip>

Quote:
>   wxx[\$1]+=\$2*\$3*\$3   # sum(weight*value^2)

<snip>

Quote:
>   for (item in c) {
>     m=wx[item]/w[item] # weighted mean
>     v=(wxx[item]-wx[item]*m)/w[item] # weighted variance
>     printf fmt, item,c[item],m,v
>     tc+=c[item]       # total count
>     tw+=w[item]       # total sum(weight)
>     twx+=wx[item]     # total sum(weight*value)
>     twxx+=wxx[item]   # total sum(weight*value^2)
>   }
>   tm=twx/tw           # total weighted mean
>   tv=(twxx-twx*tm)/tw # total weighted variance

<snip>

Quote:
>variance = (sum(weight*value^2) - mean*sum(weight*value)) / sum(weight)

>Note that, in this form of the equation, the mean does not appear
>inside a summation so you don't need to know the mean while doing the
>summations.

<snip>

While _mathematically_  E[(x - E[x])^2]  ==  E[x^2] - E[x]^2, it's
possible to maintain greater precision with the former than with the
latter. In the example given, maybe no big deal. With hundreds or
thousand of observations, it definitely does matter.

Sent via Deja.com http://www.deja.com/

Sat, 08 Mar 2003 03:00:00 GMT
Variance calculation

writes

Quote:

[snip]
>While _mathematically_  E[(x - E[x])^2]  ==  E[x^2] - E[x]^2, it's
>possible to maintain greater precision with the former than with the
>latter. In the example given, maybe no big deal. With hundreds or
>thousand of observations, it definitely does matter.

>Sent via Deja.com http://www.deja.com/

I knew someone would say this so why didn't I do it right first time?

The accumulation of error can be reduced without losing the advantage of
scanning the data only once if you know some approximation to the mean

Suppose the true mean is m and an approximate mean is m0.

Calculate:-
w   = sum(weight)
wx  = sum(weight*(value-m0))
wxx = sum(weight*(value-m0)^2)

Then:-
m = m0 + wx / w
v = wxx / w - (m - m0)^2

The program I posted yesterday effectively used m0=0. Any value of m0 is
mathematically valid but m0=m gives minimum accumulation of truncation
error. Unfortunately you don't know m in advance.

Even an approximate value of m0 works quite well so you could use the
first value read in as m0 and get a significant reduction in accumulated
error. The relative error in the variance will be greatest when the
variance is small. Fortunately this is just where using the first value
as m0 is most effective because the first value must be close to the
mean for the variance to be small.

This method of summations has an advantage that you can add more values,
update the summations and recalculate the mean and variance very
cheaply. You can also remove values from the data set and recalculate
mean and variance. You can even add or remove large blocks of data just
by adding or subtracting the corresponding summations and recalculating
mean and variance. This suggests the following strategy to reduce
truncation error for very large data sets.

read a sample of the data set accumulating w, wx and wxx
w   = sum(weight)
wx  = sum(weight*(value-m0))
wxx = sum(weight*(value-m0)^2)
calculate m and v for the data so far
m   = m0 + wx / w
v   = wxx / w - (m-m0)^2
adjust wx and wxx as if you had used m rather than m0 all along
wx  = w * m
wxx = w * v
m0  = m
continue accumulating w, wx and wxx for the rest of the data
recalculate m and v at the end
m   = m0 + wx / w
v   = wxx / w - (m-m0)^2

The motivation here is to get a good approximation to the mean from a
sample of the data and then to use to it to keep truncation error small,

I'm making this up as I go along so I hope it's right. I am too tired to
test it at present.
--
Alan Linton

Sun, 09 Mar 2003 03:00:00 GMT
Variance calculation

Quote:

>[snip]
>>While _mathematically_  E[(x - E[x])^2]  ==  E[x^2] - E[x]^2, it's
>>possible to maintain greater precision with the former than with the
>>latter. In the example given, maybe no big deal. With hundreds or
>>thousand of observations, it definitely does matter.

>I knew someone would say this so why didn't I do it right first time?

>The accumulation of error can be reduced without losing the advantage
>of scanning the data only once if you know some approximation to the

<snip>

Quote:
>Even an approximate value of m0 works quite well so you could use the
>first value read in as m0 and get a significant reduction in
>accumulated error. The relative error in the variance will be greatest
>when the variance is small. Fortunately this is just where using the
>first value as m0 is most effective because the first value must be
>close to the mean for the variance to be small.

<snip>

True if the dataset were a single undifferentiated population. However,
the OP's problem involves many different classes, so an m0 that's
appropriate for one class won't necessarily be appropriate for other
classes. It may be possible to structure the sums to use different m0's
for each sample, but at some point the added complexity makes the the
simple two-pass approach (one i/o pass, one iteration through stored
observations) attractive.

Sent via Deja.com http://www.deja.com/

Mon, 10 Mar 2003 03:00:00 GMT
Variance calculation

writes
[snip]

Quote:

>>Even an approximate value of m0 works quite well so you could use the
>>first value read in as m0 and get a significant reduction in
>>accumulated error. The relative error in the variance will be greatest
>>when the variance is small. Fortunately this is just where using the
>>first value as m0 is most effective because the first value must be
>>close to the mean for the variance to be small.

><snip>

>True if the dataset were a single undifferentiated population. However,
>the OP's problem involves many different classes, so an m0 that's
>appropriate for one class won't necessarily be appropriate for other
>classes. It may be possible to structure the sums to use different m0's
>for each sample, but at some point the added complexity makes the the
>simple two-pass approach (one i/o pass, one iteration through stored
>observations) attractive.

>Sent via Deja.com http://www.deja.com/

There is not that much added complexity. Here's the program.

#!gawk -f
#statistics : reduced truncation error version

NF==3 {
c[\$1]++             # count
if (c[\$1]==1) m0[\$1]=\$3
\$3-=m0[\$1]
w[\$1]+=\$2           # sum(weight)
wx[\$1]+=\$2*\$3       # sum(weight*value)
wxx[\$1]+=\$2*\$3*\$3   # sum(weight*value^2)

Quote:
}

END {
print "Statistics\n"
printf "%10s %5s %9s %9s\n", "item","count","mean","variance"
fmt="%10s %5d %9.4f %9.4f\n"
for (item in c) {
m=wx[item]/w[item]+m0[item]        # weighted mean
v=wxx[item]/w[item]-(m-m0[item])^2 # weighted variance
printf fmt, item,c[item],m,v
tc+=c[item]            # total count
tw+=w[item]            # total sum(weight)
twx+=w[item]*m         # total sum(weight*value)
twxx+=w[item]*(v+m*m)  # total sum(weight*value^2)
}
tm=twx/tw                # total weighted mean
tv=(twxx-twx*tm)/tw      # total weighted variance
print ""
printf fmt, "Total",tc,tm,tv

Quote:
}

--
Alan Linton

Mon, 10 Mar 2003 03:00:00 GMT

 Page 1 of 1 [ 8 post ]

Relevant Pages