writes

Quote:

>I'm struggling with an awk script that I want to use to calculate some

>basic statistics like weighted means and variances. The script should

>be able to calculate means and variances for each unique group within an

>array and also calculate the same stats for the entire data set. My

>problem has been looping through the data properly. The attached script

>properly calculates the weighted means for each group and the total data

>set, but I am screwing up with the variance calculation.

>The variance calculation can only be made after the means for each group

>and the total population have been made. Then the variance calculation

>should be: the sum of the squares for each value minus the mean * the

>weight of the sample. This sum is then divided by the sum of the

>weights.

>Here is my script:

>{

>if ($2 > 0) {

> item = $1 # Column no. of item

> weight = $2 # Column no. of weight

> value = $3 # Column no. of value

> itm[item]++ # Count of items by type

> count++ # Count of all items

> wt[item] = $2 # Weight of items by type

> iwt[item] += $2 # Sum of weight by item

> val[item] = $3 # Value of item by type

> sumwt[item] += $2 # Sum of weights by item

> totwt += $2 # Total weight of all items

># Sum of value*weight

># ====================

> valwt[item] += $2*$3

># Weighted mean by item

># =====================

> imean[item] = valwt[item] / iwt[item]

># Global weighted mean

># =====================

> gwtval += $2 * $3

> gmean = gwtval / totwt

> }

> }

>{

>if ($1 > 0) {

> for ( i = 1; i <= NR; i++)

># Variance by item

># =====================

> tmpvar[item] += (val[item] - imean[item])^2 * wt[item]

> ivar[item] = tmpvar[item] / iwt[item]

># Global variance

># =====================

> tmp_var += ($3 - gmean)^2 * $2

> gvar = tmp_var / totwt

> }

>}

>END {

> print "Item Number Mean Variance"

> print " "

> for ( each in itm )

> printf "%-15s %8.0f %10.3f %10.4f \n", each,itm[each],imean[each],\

> ivar[each] | "sort"

> print " "

> printf "%-15s %8.0f %10.3f %10.4f \n", "Total", count, gmean,gvar

>}

>A dummy data set could look like this:

>Item Wt. Value

>Apples 50 2.00

>Apples 100 1.00

>Lemons 50 6.00

>Lemons 100 3.00

>Oranges 50 4.00

>Oranges 100 2.00

>Again, my problem is that I don't know how to properly loop through the

>data set twice. I've tried using do and for loops, but have failed

>miserbly.

>I would appreciate any help you can give me.

>Any responses can be sent to this forum or at my email address:

>Again, thanks in advance.

>Mike Lechner

>Sent via Deja.com http://www.deja.com/

>Before you buy.

Hi Mike,

It is possible to rearrange the equations for weighted mean and weighted

variance so that you can calculate the variance without knowing the mean

in advance. To do this you need to calculate some summations as you go

along but you only have to read the data once. This program shows you

how it's done.

#!gawk -f

#statistics

NF==3 {

c[$1]++ # count

w[$1]+=$2 # sum(weight)

wx[$1]+=$2*$3 # sum(weight*value)

wxx[$1]+=$2*$3*$3 # sum(weight*value^2)

Quote:

}

END {

print "Fruit Statistics\n"

printf "%10s %5s %9s %9s\n", "item","count","mean","variance"

fmt="%10s %5d %9.4f %9.4f\n"

for (item in c) {

m=wx[item]/w[item] # weighted mean

v=(wxx[item]-wx[item]*m)/w[item] # weighted variance

printf fmt, item,c[item],m,v

tc+=c[item] # total count

tw+=w[item] # total sum(weight)

twx+=wx[item] # total sum(weight*value)

twxx+=wxx[item] # total sum(weight*value^2)

}

tm=twx/tw # total weighted mean

tv=(twxx-twx*tm)/tw # total weighted variance

print ""

printf fmt, "Total",tc,tm,tv

Quote:

}

I used your data and got this result:-

Fruit Statistics

item count mean variance

Oranges 2 2.6667 0.8889

Lemons 2 4.0000 2.0000

Apples 2 1.3333 0.2222

Total 6 2.6667 2.2222

The definitions I have used are:-

mean = sum( weight * value ) / sum( weight )

variance = sum( weight * (value - mean) ^ 2 ) / sum( weight )

If all the weights were 1 these would reduce to the usual definitions of

population mean and variance.

I rearranged the equation for variance as:-

variance = (sum(weight*value^2) - mean*sum(weight*value)) / sum(weight)

Note that, in this form of the equation, the mean does not appear inside

a summation so you don't need to know the mean while doing the

summations.

This rearrangement is not difficult to verify.

Hope this helps

--

Alan Linton