awk primer - please comment 
Author Message
 awk primer - please comment

Hi,

I asked for help with my first awk script in comp.unix.shell and many
people came up with working sed and awk (and even perl) scripts (see
task below[1]).

This is what I "destilled" of it...

[bounces.awk]
#! /usr/bin/gawk -f

BEGIN { FS  = ";"
        OFS = ";"
        BOUNCES_FILE = "Bounces.txt"

while (getline < BOUNCES_FILE)
       bounces["\"" $0 "\""] }

{ if (NR == 1)
{        print $0, "\"BOUNCE\"" }
  else { print $0, "\"" ($3 in bounces) "\"" }}

Could someone please comment on style/readibility/technique and
performance?! (I think there are some "dos and don'ts" in awk.)

Thorsten

[1]
There are two text files:

[Email_Addresses.csv]
"SURNAME";"GIVENNAME";"EMAILADDRESS"
"Surname1";"Givenname1";"Email_Address1"
"Surname2";"Givenname2";"Email_Address2"
"Surname3";"Givenname3";"Email_Address3"
"Surname4";"Givenname4";"Email_Address4"
[...]

[Bounces.txt]
Email_Address1
Email_Address4
[...]

This should be the output:

"SURNAME";"GIVENNAME";"EMAILADDRESS";"BOUNCE"
"Surname1";"Givenname1";"Email_Address1";"1"
"Surname2";"Givenname2";"Email_Address2";"0"
"Surname3";"Givenname3";"Email_Address3";"0"
"Surname4";"Givenname4";"Email_Address4";"1"
[...]

...meaning: starting from row 2, for every value in column 3
("EMAILADDRESS") there should be a '";1"' appended if the email
address is in "Bounces.txt" otherwise it should be ';"0"'.



Wed, 09 Feb 2005 07:21:33 GMT  
 awk primer - please comment

Quote:
> Hi,

> I asked for help with my first awk script in comp.unix.shell and many
> people came up with working sed and awk (and even perl) scripts (see
> task below[1]).

> This is what I "destilled" of it...

> [bounces.awk]
> #! /usr/bin/gawk -f

> BEGIN { FS  = ";"
>         OFS = ";"
>         BOUNCES_FILE = "Bounces.txt"

> while (getline < BOUNCES_FILE)
>        bounces["\"" $0 "\""] }

> { if (NR == 1)
> {        print $0, "\"BOUNCE\"" }
>   else { print $0, "\"" ($3 in bounces) "\"" }}

> Could someone please comment on style/readibility/technique and
> performance?! (I think there are some "dos and don'ts" in awk.)

> Thorsten

> [1]
> There are two text files:

> [Email_Addresses.csv]
> "SURNAME";"GIVENNAME";"EMAILADDRESS"
> "Surname1";"Givenname1";"Email_Address1"
> "Surname2";"Givenname2";"Email_Address2"
> "Surname3";"Givenname3";"Email_Address3"
> "Surname4";"Givenname4";"Email_Address4"
> [...]

> [Bounces.txt]
> Email_Address1
> Email_Address4
> [...]

> This should be the output:

> "SURNAME";"GIVENNAME";"EMAILADDRESS";"BOUNCE"
> "Surname1";"Givenname1";"Email_Address1";"1"
> "Surname2";"Givenname2";"Email_Address2";"0"
> "Surname3";"Givenname3";"Email_Address3";"0"
> "Surname4";"Givenname4";"Email_Address4";"1"
> [...]

> ...meaning: starting from row 2, for every value in column 3
> ("EMAILADDRESS") there should be a '";1"' appended if the email
> address is in "Bounces.txt" otherwise it should be ';"0"'.

I'm writing from a news server with a slow feed, so I haven't seen any other
postes yet.  Most things here are personal issues, like indentation and
single-statement braces.

1. Indent nicely:

    BEGIN {
        FS = ";"
        OFS = ";"
        BOUNCES_FILE = "Bounces.txt"
        while (getline < BOUNCES_FILE)
            bounces["\"" $0 "\""]
    }
    {
        if (NR == 1) {
            print $0, "\"BOUNCE\""
        } else {
            print $0, "\"" ($3 in bounces) "\""
        }
    }

2. Remove unneccesary braces.

    {
        if (NR == 1)
            print $0, "\"BOUNCE\""
        else
            print $0, "\"" ($3 in bounces) "\""
    }

3. In some people's eyes, FS and OFS are so tightly coupled that if they are
the same, cascaded assignments are just fine.

        FS = OFS = ";"

4. getline, it was once pointed out to me, is tristate; returning -1 on
error, 0 at eof, and 1 if sucessful.

        while (getline < BOUNCES_FILE > 0)
            bounces["\"" $0 "\""]

5. Elimate redundancy in expressions:

        if (NR == 1)
            temp = "BOUNCE"
        else
            temp = $3 in bounces
        print $0, "\"" temp "\""

6. Simple if()s that only assign can be reduced to "?:".

        temp = NR == 1 ? "BOUNCE" : $3 in bounces

7. Don't use temp variables if they are set once above and used once below.

        print $0, "\"" (NR == 1 ? "BOUNCE" : $3 in bounces) "\""

So, you end up with:

    BEGIN {
        FS = OFS = ";"
        BOUNCES_FILE = "Bounces.txt"
        while (getline < BOUNCES_FILE > 0)
            bounces["\"" $0 "\""]
    }
    { print $0, "\"" (NR == 1 ? "BOUNCE" : $3 in bounces) "\"" }

8. From there, you could suck the first-line case up into the BEGIN rule.

    BEGIN {
        FS = OFS = ";"
        BOUNCES_FILE = "Bounces.txt"
        while (getline < BOUNCES_FILE > 0)
            bounces["\"" $0 "\""]
        if (getline > 0)
            print $0, "\"BOUNCE\""
    }
    { print $0, "\"" $3 in bounces "\"" }

"if (getline > 0)" gets the first line of your input file into $0, while
still in the BEGIN block.  Now, you don't have to do an equailty test for
each data line.

If the bounces file is bigger than virtual RAM, you have a design issue;
other than that, I'd stop here.

    - Dan



Wed, 09 Feb 2005 09:40:36 GMT  
 awk primer - please comment
Syntax error below...


Quote:


> > Hi,

> > I asked for help with my first awk script in comp.unix.shell and many
> > people came up with working sed and awk (and even perl) scripts (see
> > task below[1]).

> > This is what I "destilled" of it...

> > [bounces.awk]
> > #! /usr/bin/gawk -f

> > BEGIN { FS  = ";"
> >         OFS = ";"
> >         BOUNCES_FILE = "Bounces.txt"

> > while (getline < BOUNCES_FILE)
> >        bounces["\"" $0 "\""] }

> > { if (NR == 1)
> > {        print $0, "\"BOUNCE\"" }
> >   else { print $0, "\"" ($3 in bounces) "\"" }}

> > Could someone please comment on style/readibility/technique and
> > performance?! (I think there are some "dos and don'ts" in awk.)

> > Thorsten

> > [1]
> > There are two text files:

> > [Email_Addresses.csv]
> > "SURNAME";"GIVENNAME";"EMAILADDRESS"
> > "Surname1";"Givenname1";"Email_Address1"
> > "Surname2";"Givenname2";"Email_Address2"
> > "Surname3";"Givenname3";"Email_Address3"
> > "Surname4";"Givenname4";"Email_Address4"
> > [...]

> > [Bounces.txt]
> > Email_Address1
> > Email_Address4
> > [...]

> > This should be the output:

> > "SURNAME";"GIVENNAME";"EMAILADDRESS";"BOUNCE"
> > "Surname1";"Givenname1";"Email_Address1";"1"
> > "Surname2";"Givenname2";"Email_Address2";"0"
> > "Surname3";"Givenname3";"Email_Address3";"0"
> > "Surname4";"Givenname4";"Email_Address4";"1"
> > [...]

> > ...meaning: starting from row 2, for every value in column 3
> > ("EMAILADDRESS") there should be a '";1"' appended if the email
> > address is in "Bounces.txt" otherwise it should be ';"0"'.

> I'm writing from a news server with a slow feed, so I haven't seen any
other
> postes yet.  Most things here are personal issues, like indentation and
> single-statement braces.

> 1. Indent nicely:

>     BEGIN {
>         FS = ";"
>         OFS = ";"
>         BOUNCES_FILE = "Bounces.txt"
>         while (getline < BOUNCES_FILE)
>             bounces["\"" $0 "\""]
>     }
>     {
>         if (NR == 1) {
>             print $0, "\"BOUNCE\""
>         } else {
>             print $0, "\"" ($3 in bounces) "\""
>         }
>     }

> 2. Remove unneccesary braces.

>     {
>         if (NR == 1)
>             print $0, "\"BOUNCE\""
>         else
>             print $0, "\"" ($3 in bounces) "\""
>     }

> 3. In some people's eyes, FS and OFS are so tightly coupled that if they
are
> the same, cascaded assignments are just fine.

>         FS = OFS = ";"

> 4. getline, it was once pointed out to me, is tristate; returning -1 on
> error, 0 at eof, and 1 if sucessful.

>         while (getline < BOUNCES_FILE > 0)
>             bounces["\"" $0 "\""]

> 5. Elimate redundancy in expressions:

>         if (NR == 1)
>             temp = "BOUNCE"
>         else
>             temp = $3 in bounces
>         print $0, "\"" temp "\""

> 6. Simple if()s that only assign can be reduced to "?:".

>         temp = NR == 1 ? "BOUNCE" : $3 in bounces

> 7. Don't use temp variables if they are set once above and used once
below.

>         print $0, "\"" (NR == 1 ? "BOUNCE" : $3 in bounces) "\""

> So, you end up with:

>     BEGIN {
>         FS = OFS = ";"
>         BOUNCES_FILE = "Bounces.txt"
>         while (getline < BOUNCES_FILE > 0)
>             bounces["\"" $0 "\""]
>     }
>     { print $0, "\"" (NR == 1 ? "BOUNCE" : $3 in bounces) "\"" }

> 8. From there, you could suck the first-line case up into the BEGIN rule.

>     BEGIN {
>         FS = OFS = ";"
>         BOUNCES_FILE = "Bounces.txt"
>         while (getline < BOUNCES_FILE > 0)
>             bounces["\"" $0 "\""]
>         if (getline > 0)
>             print $0, "\"BOUNCE\""
>     }
>     { print $0, "\"" $3 in bounces "\"" }

> "if (getline > 0)" gets the first line of your input file into $0, while
> still in the BEGIN block.  Now, you don't have to do an equailty test for
> each data line.

> If the bounces file is bigger than virtual RAM, you have a design issue;
> other than that, I'd stop here.

>     - Dan

This line:
Quote:
>     { print $0, "\"" $3 in bounces "\"" }

should read
     { print $0, "\"" ($3 in bounces) "\"" }

    - Dan



Wed, 09 Feb 2005 13:01:17 GMT  
 awk primer - please comment
Thanks - just what the awk doctor ordered...

* Dan Haygood

Quote:


>> [first awk script]
>> Could someone please comment on style/readibility/technique and
>> performance?! (I think there are some "dos and don'ts" in awk.)

> [...] Most things here are personal issues, like indentation and
> single-statement braces.

Formatting is *very* important, but I'm still testing "styles". It
should be concise and offer maximum readability to yourself (the one
that writes the code) and to the others who are trying to understand
your code.

Quote:
> 2. Remove unneccesary braces.

Yes, but you often forget to place them if your single then/else
action grows.

Quote:
> 3. In some people's eyes, FS and OFS are so tightly coupled that if they are
> the same, cascaded assignments are just fine.

>        FS = OFS = ";"

You're right, of course.

Quote:
> 4. getline, it was once pointed out to me, is tristate; returning -1 on
> error, 0 at eof, and 1 if sucessful.

So my version would be an endless loop if "Bounces.txt" did not exist.

Quote:
> 5. Elimate redundancy in expressions:

>        if (NR == 1)
>            temp = "BOUNCE"
>        else
>            temp = $3 in bounces
>        print $0, "\"" temp "\""

In fact, I noticed that and tried something like...
printf "%s", $0 OFS "\""
printf "%s", NR == 1 ? "BOUNCE" : $3 in bounces
print "\""

...but...

Quote:
>        print $0, "\"" (NR == 1 ? "BOUNCE" : $3 in bounces) "\""

...is definitely how it should've been done.

Quote:
> 8. From there, you could suck the first-line case up into the BEGIN rule.
> [...]
> "if (getline > 0)" gets the first line of your input file into $0, while
> still in the BEGIN block.  Now, you don't have to do an equailty test for
> each data line.

Yes, the testing in the loop was ugly.

Quote:
> If the bounces file is bigger than virtual RAM, you have a design issue;

First I tried...
system("grep '" substr($3, 2, length($3) - 2) "' Bounces.txt >
/dev/null")

...which prints grep's error code (as a substitution for "$3 in
bounces"), but that was too slow for big files.

I even tried "match()" as replacement for grep. But...

BOUNCES = system("cat Bounces.txt")
match(BOUNCES, $3)
  or
match(system("cat Bounces.txt"), $3)

...didn't work.

Thanks again, Thorsten



Wed, 09 Feb 2005 17:50:16 GMT  
 awk primer - please comment


I generally agree with what Dan said. I'll point out that some people
like unnecessary braces because you don't have to understand the
scope rules as well. I personally dislike clutter.

% { if (NR == 1)
% {        print $0, "\"BOUNCE\"" }
%   else { print $0, "\"" ($3 in bounces) "\"" }}

Granted that it's better if you take care of the NR == 1 case in a BEGIN
block, in  cases where you want to apply a test to each input record,
the normal awk style is to do the test in a pattern:

  NR == 1 { print $0, "\"BOUNCE\""; next }
  { print $0, "\"" ($3 in bounces) "\"" }

I personally find it more readable when closing braces go on a line by
themselves, except in the case where the entire block goes on one line.

For instance, if I were to spread my NR == 1 action over two lines,
I would write it like this:

  NR == 1 { print $0, "\"BOUNCE\""
            next
  }

rather than this

  NR == 1 { print $0, "\"BOUNCE\""
            next }
--

Patrick TJ McPhee
East York  Canada



Thu, 10 Feb 2005 02:45:28 GMT  
 awk primer - please comment


Quote:


> I generally agree with what Dan said. I'll point out that some people
> like unnecessary braces because you don't have to understand the
> scope rules as well. I personally dislike clutter.

> % { if (NR == 1)
> % {        print $0, "\"BOUNCE\"" }
> %   else { print $0, "\"" ($3 in bounces) "\"" }}

> Granted that it's better if you take care of the NR == 1 case in a BEGIN
> block, in  cases where you want to apply a test to each input record,
> the normal awk style is to do the test in a pattern:

>   NR == 1 { print $0, "\"BOUNCE\""; next }
>   { print $0, "\"" ($3 in bounces) "\"" }

> I personally find it more readable when closing braces go on a line by
> themselves, except in the case where the entire block goes on one line.

> For instance, if I were to spread my NR == 1 action over two lines,
> I would write it like this:

>   NR == 1 { print $0, "\"BOUNCE\""
>             next
>   }

> rather than this

>   NR == 1 { print $0, "\"BOUNCE\""
>             next }
> --

> Patrick TJ McPhee
> East York  Canada


Hi, again -

Clutter, whitespace, and the braced single statement...

Quote:
> I personally dislike clutter.

I think the there are two issues with clutter:  First, it makes code more
difficult to read.  If the code is more difficult to read, the idea it is
trying to express is harder to understand.

The second falls out of the first: If you can clearly see that two sections
of code express the same idea, you now have the opportunity to analyse your
code.  You can:
    - see that the ideas are expressed equivalently.  This allows someone
else looking at your code some common ground, where only the details are
expressed differently.  It also allows you some insight as to why, exactly,
there are two sections of code, instead of just one.
    - check that similar code should be similar:  For instance, if you get
to this point in your code, with teh for loops stacked on top of each other:
        if (alpha)
            for (i = 1; i <= n; i++) doAlpha(a[i])
        else
            for (i = 1; i < n; i++) doNumeric(r[i])
you can clearly see that you may have forgotton the last element of
r[]...and verify, perhaps that this is really how it is SUPPOSED to be.
And, it reminds you that you might want to document this for someone else
coming along after you.
    - remove truly duplicated code that varies only in condition or
environment to a single routine to make like easier for those coming after
you.

There is a heavy ephasis in "someone else" and "coming after you" because
(in my experience, at least) that person will be you...after you've
forgotton all the details, and most of the reasons, why you did something.

Oh yeah - whitespace is often "clutter", too.  While sometimes providing a
break in the train of thought, like indentation at the start of a paragraph,
remember it also serves to spatially separate the ideas you are trying to
express.

Imagine a keynote speaker saying one or two sentences, then breaking for
five minutes for everyone to refill their horse ovaries plates.  Another
sentence, another snack break.  You'd have to take notes to understand the
whole speech in sequence.  And, so will your maintenance programmers, if you
structure your source code the same way.

It's also much easier to understand a routine when you can see the whole
thing in the editor at once, without scrolling...my rule of thumb was 20
lines, but with the advent of GUIs, that has increased a bit.  But if you
double-space your code, as some (particularly C programmers, for some
reason) are wont to do, you end up with only 10 statements on the screen at
a time.  And, yes, I often count braces on their own line as
"double-spacing".  Except sometimes--when a maintaining code, its nice if
end braces line up with the top of their block, and this requires braces (in
C-family languages) to be on their own line.

One last note:  Some people use single-statement braces when they are
developing, because they never know when they might want more than just one
statement in a branch or loop body.  If you get in the habit of ALWAYS
bracing, then adding the second statement to a body requires a lot less
typing.  I like to remove these extra braces when I'm "done", because I
think it's easier to read.  Others like to add them if they are missing, and
then leave them in for the next fellow, because THEY think it's easier to
read.

So I guess, to each, their own--until the organization's Code Dictator
installs a pretty-printer, and makes eveyone unhappy.

    - Dan



Thu, 10 Feb 2005 04:15:07 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. Python primer - comments appreciated!

2. Arrays in awk/awk help please!

3. filtering comments with awk

4. Comments, Please: a script to parse plain text lyrics to HTML

5. Embed comments and CW 4 - Please help

6. Interim release of manuals? comments please

7. Comments please

8. Comments please

9. Switching from PHP to Ruby - Comments Please

10. X3D concern -- a copy of a letter I recently sent to some X3D members -- please comment

11. please provide your comments on coding style

 

 
Powered by phpBB® Forum Software