awk problem sorting and deduping file 
Author Message
 awk problem sorting and deduping file

I am quite new to unix.

I have a sort statement below which pipes it's output to nawk ......

 sort -T /sortfs -r -k 6,7 -k 4.5,4.8 -k 4.3,4.4 -k 4.1,4.2 -k 4.10,4.15
 -t'|' $WK1 | nawk -F'|' '$6 != chain_num || $7 != branch_num { chain_num =
 $6 ; branch_num = $7 ; print $0 }' | sed 's/  *|/|/g' | sed
 's/01010000:000000/09099999:000000/g' > $WK4

I know that the statement is unwieldy that is why I am asking for
advice/help. The nawk is used to dedupe records based on a particular key.

I have had a recommendation to use the following statement instead:

nawk 'BEGIN {
    command = "your sort command"
    while( ( command | getline ) > 0 ) {
       ...
       # further processing
       ...
    }
  }
  { # body
    exit(0)
  }'

If I use this way of doing it how do I check for errors individually for
sort, nawk, sed?
How do I do the sed commands within nawk?
What is the difference between the begin and body of the nawk command?

The file I am dealing with is about 1G. Record length is 1729 bytes, 187
fields and about 578,000 records.

Any help would be much appreciated.

Aaron



Sun, 09 Nov 2003 01:08:16 GMT  
 awk problem sorting and deduping file

Quote:

> I am quite new to unix.

> I have a sort statement below which pipes it's output to nawk ......

> sort -T /sortfs -r -k 6,7 -k 4.5,4.8 -k 4.3,4.4 -k 4.1,4.2 -k 4.10,4.15
> -t'|' $WK1 | nawk -F'|' '$6 != chain_num || $7 != branch_num { chain_num =
> $6 ; branch_num = $7 ; print $0 }' | sed 's/  *|/|/g' | sed
> 's/01010000:000000/09099999:000000/g' > $WK4

> I know that the statement is unwieldy that is why I am asking for
> advice/help. The nawk is used to dedupe records based on a particular key.

I'm afraid I don't know what "dedupe" means. If it means "make unique"
or "avoid duplicates", have a look at the "-u" option of "sort".

The usage of "sed" in your pipe is unnecessary, you can replace them
with the sub() or gsub() functions of nawk - or at least put them both
in one call to sed:

  sed 's/  *|/|/g;s/01010000:000000/09099999:000000/g'

Or in nawk, before your "print":

  gsub(/ *|/,"|"); gsub(/01010000:000000/,"09099999:000000")

Quote:
> I have had a recommendation to use the following statement instead:

> nawk 'BEGIN {
>    command = "your sort command"
>    while( ( command | getline ) > 0 ) {
>       ...
>       # further processing
>       ...
>    }
>  }
>  { # body
>    exit(0)
>  }'

I don't know what that should be good for in your case. The body part
with the lonely "exit(0)" is superfluous.

Quote:
> If I use this way of doing it how do I check for errors individually for
> sort, nawk, sed?

Why do you want to? If something goes wrong, the whole pipe dies.

Quote:
> How do I do the sed commands within nawk?

You don't, at least not if you only want to search and replace.

Quote:
> What is the difference between the begin and body of the nawk command?

The BEGIN part is only executed once, before the first line of the first
file. The "normal body" is executed for each line of each file.

Quote:
> The file I am dealing with is about 1G.

Oh, this will probably be much fun with "sort" :-(

Regards...
                Michael



Sun, 09 Nov 2003 03:07:52 GMT  
 awk problem sorting and deduping file
I didn't much look at his code, but dedupe in this instance i am guessing means
to have each record with unique fields 6 and 7.  REally he should provide a
sample of the data and explain what he wants done with it rather than having us
analyse his code and figure out if there is a better way of doing it.
Quote:


> > I am quite new to unix.

> > I have a sort statement below which pipes it's output to nawk ......

> > sort -T /sortfs -r -k 6,7 -k 4.5,4.8 -k 4.3,4.4 -k 4.1,4.2 -k 4.10,4.15
> > -t'|' $WK1 | nawk -F'|' '$6 != chain_num || $7 != branch_num { chain_num =
> > $6 ; branch_num = $7 ; print $0 }' | sed 's/  *|/|/g' | sed
> > 's/01010000:000000/09099999:000000/g' > $WK4

> > I know that the statement is unwieldy that is why I am asking for
> > advice/help. The nawk is used to dedupe records based on a particular key.

> I'm afraid I don't know what "dedupe" means. If it means "make unique"
> or "avoid duplicates", have a look at the "-u" option of "sort".

> The usage of "sed" in your pipe is unnecessary, you can replace them
> with the sub() or gsub() functions of nawk - or at least put them both
> in one call to sed:

>   sed 's/  *|/|/g;s/01010000:000000/09099999:000000/g'

> Or in nawk, before your "print":

>   gsub(/ *|/,"|"); gsub(/01010000:000000/,"09099999:000000")

> > I have had a recommendation to use the following statement instead:

> > nawk 'BEGIN {
> >    command = "your sort command"
> >    while( ( command | getline ) > 0 ) {
> >       ...
> >       # further processing
> >       ...
> >    }
> >  }
> >  { # body
> >    exit(0)
> >  }'

> I don't know what that should be good for in your case. The body part
> with the lonely "exit(0)" is superfluous.

> > If I use this way of doing it how do I check for errors individually for
> > sort, nawk, sed?

> Why do you want to? If something goes wrong, the whole pipe dies.

> > How do I do the sed commands within nawk?

> You don't, at least not if you only want to search and replace.

> > What is the difference between the begin and body of the nawk command?

> The BEGIN part is only executed once, before the first line of the first
> file. The "normal body" is executed for each line of each file.

> > The file I am dealing with is about 1G.

> Oh, this will probably be much fun with "sort" :-(

> Regards...
>                 Michael



Sun, 09 Nov 2003 05:27:51 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Binary Sort/Merge Sort in awk

2. how to split/sort file with awk.?

3. : using sort in awk W/O tmp file.

4. How can I sort a fixed width file using awk on a HP unix systems

5. Sorting a file w/o SORT verb

6. Outputting info from awk and using the info to name the awk output file

7. awk file problem

8. Problem of include file in AWK [HP-UH10.20]

9. Deduping Records

10. How do I use sort in awk?

11. Advise on using awk to sort and give totals

12. Using awk to sort SQL Statements

 

 
Powered by phpBB® Forum Software