matching records in a comma delimited file 
Author Message
 matching records in a comma delimited file


% gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt  >
% filteredrecords.txt

What this does is load the contents of smallfile.txt into an
array, then look up the first field of each record in bigfile.txt
in that array, and print the ones that are found.

Here's a more comprehensive explanation:

An awk program is made up of a series of patterns and actions, with the
actions enclosed in braces. Roughly speaking, each pattern is evaluated
against each record of the files listed after the program, and if it
evaluates to a non-zero number or non-zero-length string, the associated
action is executed. There are two special patterns, BEGIN, which is true
before any of the files listed on the command-line have been processed,
and END, which is true after all the files have been processed.

Either the pattern or the action can be omitted. If the pattern is
omitted, the action is executed for every input record. If the
action is omitted, the default action is to print the input record.
BEGIN and END patterns must have an action, which are sometimes called BEGIN
and END blocks.

There's a variable called ARGC and another variable called ARGV, which
contain, respectively, the number of command-line arguments meant
to be processed by the script (roughly, the list of files), and the
arguments themselves. When all the BEGIN blocks have been executed,
awk starts evaluating the ARGV array from 1 to ARGC-1. Any elements which
have the form

 variable=value

cause the value to be assigned to the named variable. Any other string
which is not zero-length is treated as a file name. The file is opened,
then each record is read and all the patterns are executed against it
as noted above.

In the example, we'll have
 ARGC = 3        # OK, its the number of files plus 1.
 ARGV[0] = "awk" # usually. anyway, it's unlikely to be useful
 ARGV[1] = "smallfile.txt"
 ARGV[2] = "bigfile.txt"

awk will open smallfile.txt, read each record (by defaul, delimited by
a new-line), and apply the tests

 ARGIND == 1
and
 $1 in x

against each record. If the first test is true for some record, these
statements are executed:

  x[$0]
  next

If the second test is true, the default action, printing the input record,
is performed.

To understand most of that, you need to know something about arrays.
Arrays in awk can have non-numeric indices. This style of array is sometimes
called `associative'. Associative arrays are often used to store their
indices rather than the value associated with each index. awk facilitates
this by adding an index to the array any time you refer to it. The
statement

  x[$0]

associates the value "" with the index $0 ($0 is the input record). i.e.,
it adds each input record to the list of indices of the associative array.

The operator `in' tests whether its left argument is defined as an
index of its right argument, which must be an array. So the pattern

  $1 in x

is true for any value of $1 which matches a value of $0 against which
  x[$0]
had been executed.

awk has several special variables which are assigned values at the whim
of the interpreter. ARGIND is a non-standard special variable which is
set to the index into the ARGC array of the current file being processed.
It works only with gawk, to the best of my knowledge. The test

 ARGIND == 1

will be true only for the first file being processed, i.e., smallfile.txt.
A more portable way to do the same test is

 FILENAME == ARGC[1]

FILENAME is another special variable which contains the name of the file
being processed. It has the benefit of working with any* awk.

next is a control-flow keyword which says to skip the remaining pattern/action
pairs and read the next input record. It's here to ensure the second
pattern is never tested against records from the first file.

Putting all that together, we have

 # make each record in smallfile.txt an index to the array x
 ARGIND == 1 { x[$0]; next }

 # for each reacord of of bigfile.txt, if it's an index to the array x,
 # and was therefore in file smallfile.txt, print it
 $1 in x

I hope this was helpful.

* there are a few significant portability issues with the awk
  language. The first is that the language was significantly
  revised in 1987, leading to `old' awk and `new' awk. Almost
  every system provides `new' awk as its default, but Sun
  persists in shipping `old' awk. The best advice is to put
  /usr/xpg4/bin first in your path when working with Solaris,
  and write a {*filter*} letter to your Sun sales rep. I've also
  seen this on Dynix, but I use a very old version of Dynix.
  The other problems are that gawk, the GNU implementation,
  contains a handful of language extensions, and that POSIX
  introduced some changes in regular expressions.

  All this is to say that by `any' awk, I mean any awk except
  for `old' awk, which may be the default on your platform
  for no good reason.
--

Patrick TJ McPhee
East York  Canada



Sun, 09 Jan 2005 11:27:43 GMT  
 matching records in a comma delimited file


...

Quote:
>* there are a few significant portability issues with the awk
>  language. The first is that the language was significantly
>  revised in 1987, leading to `old' awk and `new' awk. Almost
>  every system provides `new' awk as its default, but Sun
>  persists in shipping `old' awk. The best (**) advice is to put
>  /usr/xpg4/bin first in your path when working with Solaris,
>  and write a {*filter*} letter to your Sun sales rep. I've also
>  seen this on Dynix, but I use a very old version of Dynix.
>  The other problems are that gawk, the GNU implementation,
>  contains a handful of language extensions, and that POSIX
>  introduced some changes in regular expressions.

(**) Better advice is to get (a current version of) GAWK, compile it on all
the platforms you will use (consider it as essential as air and water),
become familiar and happy with the nifty extensions it provides, and then
stop worrying about all this portability nonsense that makes up 90% of the
traffic in this NG (1).  To me, the whole point of GAWK is to provide a
common, cross-platform solution that saves people from playing the Russian
Roulette that is the norm when using vendor-supplied AWKs (2).

(1) It occurred to me the other day that we don't discuss much AWK here;
rather we continually ponder the existential question: "What is AWK?"

(2) Just to make this explicit, consider that the following 3 platforms
probably account for a very large percentage of AWK use (yes, I know, you
are going to say "But I use Dynix", well, fine...):

(presented in no particular order)
        a) Linux
        b) Solaris
        c) DOS and its spawn (I.e., PC OSes)

Now just look at the RR aspects of using "awk" on these platforms.  On
Linux, you will almost certainly get GAWK, although I know of at least one
distro where the default is MAWK, and you have to explicitly install GAWK
to get it.  On Solaris, you get, well, its been well documented what you
get under Solaris.  And, on the PC, you get God knows what from MicroSludge.

How much simpler life is if you just use GAWK on all these platforms!  It
is, after all, the second best implementation of AWK in the world!

The point is, I really don't see why you should contort your code to
accommodate crappy installations, when you should always be able to get GAWK
for your platform.



Tue, 11 Jan 2005 00:59:50 GMT  
 matching records in a comma delimited file
Thanks a Lot Patrick for the extensive explanation.

I am a little confused about ARGIND variable. I tried looking up for simple
explanations on the web but couldnt fine one .

Can you or some Unix Guru please explain what ARGIND is... with a small
simple example if possible.

It would greatly help me in understanding the issue in hand.

Thanks a Lot for all the Help
Ronnie Yours



Quote:


> % gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt

> % filteredrecords.txt

> What this does is load the contents of smallfile.txt into an
> array, then look up the first field of each record in bigfile.txt
> in that array, and print the ones that are found.

> Here's a more comprehensive explanation:

> An awk program is made up of a series of patterns and actions, with the
> actions enclosed in braces. Roughly speaking, each pattern is evaluated
> against each record of the files listed after the program, and if it
> evaluates to a non-zero number or non-zero-length string, the associated
> action is executed. There are two special patterns, BEGIN, which is true
> before any of the files listed on the command-line have been processed,
> and END, which is true after all the files have been processed.

> Either the pattern or the action can be omitted. If the pattern is
> omitted, the action is executed for every input record. If the
> action is omitted, the default action is to print the input record.
> BEGIN and END patterns must have an action, which are sometimes called
BEGIN
> and END blocks.

> There's a variable called ARGC and another variable called ARGV, which
> contain, respectively, the number of command-line arguments meant
> to be processed by the script (roughly, the list of files), and the
> arguments themselves. When all the BEGIN blocks have been executed,
> awk starts evaluating the ARGV array from 1 to ARGC-1. Any elements which
> have the form

>  variable=value

> cause the value to be assigned to the named variable. Any other string
> which is not zero-length is treated as a file name. The file is opened,
> then each record is read and all the patterns are executed against it
> as noted above.

> In the example, we'll have
>  ARGC = 3        # OK, its the number of files plus 1.
>  ARGV[0] = "awk" # usually. anyway, it's unlikely to be useful
>  ARGV[1] = "smallfile.txt"
>  ARGV[2] = "bigfile.txt"

> awk will open smallfile.txt, read each record (by defaul, delimited by
> a new-line), and apply the tests

>  ARGIND == 1
> and
>  $1 in x

> against each record. If the first test is true for some record, these
> statements are executed:

>   x[$0]
>   next

> If the second test is true, the default action, printing the input record,
> is performed.

> To understand most of that, you need to know something about arrays.
> Arrays in awk can have non-numeric indices. This style of array is
sometimes
> called `associative'. Associative arrays are often used to store their
> indices rather than the value associated with each index. awk facilitates
> this by adding an index to the array any time you refer to it. The
> statement

>   x[$0]

> associates the value "" with the index $0 ($0 is the input record). i.e.,
> it adds each input record to the list of indices of the associative array.

> The operator `in' tests whether its left argument is defined as an
> index of its right argument, which must be an array. So the pattern

>   $1 in x

> is true for any value of $1 which matches a value of $0 against which
>   x[$0]
> had been executed.

> awk has several special variables which are assigned values at the whim
> of the interpreter. ARGIND is a non-standard special variable which is
> set to the index into the ARGC array of the current file being processed.
> It works only with gawk, to the best of my knowledge. The test

>  ARGIND == 1

> will be true only for the first file being processed, i.e., smallfile.txt.
> A more portable way to do the same test is

>  FILENAME == ARGC[1]

> FILENAME is another special variable which contains the name of the file
> being processed. It has the benefit of working with any* awk.

> next is a control-flow keyword which says to skip the remaining
pattern/action
> pairs and read the next input record. It's here to ensure the second
> pattern is never tested against records from the first file.

> Putting all that together, we have

>  # make each record in smallfile.txt an index to the array x
>  ARGIND == 1 { x[$0]; next }

>  # for each reacord of of bigfile.txt, if it's an index to the array x,
>  # and was therefore in file smallfile.txt, print it
>  $1 in x

> I hope this was helpful.

> * there are a few significant portability issues with the awk
>   language. The first is that the language was significantly
>   revised in 1987, leading to `old' awk and `new' awk. Almost
>   every system provides `new' awk as its default, but Sun
>   persists in shipping `old' awk. The best advice is to put
>   /usr/xpg4/bin first in your path when working with Solaris,
>   and write a {*filter*} letter to your Sun sales rep. I've also
>   seen this on Dynix, but I use a very old version of Dynix.
>   The other problems are that gawk, the GNU implementation,
>   contains a handful of language extensions, and that POSIX
>   introduced some changes in regular expressions.

>   All this is to say that by `any' awk, I mean any awk except
>   for `old' awk, which may be the default on your platform
>   for no good reason.
> --

> Patrick TJ McPhee
> East York  Canada




Tue, 11 Jan 2005 03:05:59 GMT  
 matching records in a comma delimited file

Quote:

>Thanks a Lot Patrick for the extensive explanation.

>I am a little confused about ARGIND variable. I tried looking up for simple
>explanations on the web but couldnt fine one .

As Patrick and others have noted, ARGIND is GAWK-specific.  So, you have to
look in a GAWK manual (or do "man gawk") to find out what it does and how
it is used.

Both GAWK & TAWK have direct ways of telling which file you are on (and of
the two, GAWK's method is preferable); various kludges can be used to get
this functionality in other implementations.



Tue, 11 Jan 2005 03:34:00 GMT  
 matching records in a comma delimited file
Another thing which is confusing me is "How does awk know that it has to
check the values in the small file against the values in the FIRST FIELD of
the big file and not the 3rd or the 4th field".

I am sorry about constantly bugging you with questions that might be trivial
to most people but I was basically a windows guy throughout my career and am
in the process of transition to Unix.

The solution although it works great is in short hand and because of that
its difficult for me to understand it properly although I am getting a feel
of whats happenning I am still not fully comfortable.

Is it possible for someone to post the long form/hand of the solution where
every step is visible to me.
    gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt

Quote:
>  filteredrecords.txt

Also whats the best way to learn awk and shell scripting in general. Any
suggestion for Good Books/ Web sites.

Thanks
Ronnie Yours
Oracle DBA


Quote:
> Thanks a Lot Patrick for the extensive explanation.

> I am a little confused about ARGIND variable. I tried looking up for
simple
> explanations on the web but couldnt fine one .

> Can you or some Unix Guru please explain what ARGIND is... with a small
> simple example if possible.

> It would greatly help me in understanding the issue in hand.

> Thanks a Lot for all the Help
> Ronnie Yours





> > % gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt
bigfile.txt

> > % filteredrecords.txt

> > What this does is load the contents of smallfile.txt into an
> > array, then look up the first field of each record in bigfile.txt
> > in that array, and print the ones that are found.

> > Here's a more comprehensive explanation:

> > An awk program is made up of a series of patterns and actions, with the
> > actions enclosed in braces. Roughly speaking, each pattern is evaluated
> > against each record of the files listed after the program, and if it
> > evaluates to a non-zero number or non-zero-length string, the associated
> > action is executed. There are two special patterns, BEGIN, which is true
> > before any of the files listed on the command-line have been processed,
> > and END, which is true after all the files have been processed.

> > Either the pattern or the action can be omitted. If the pattern is
> > omitted, the action is executed for every input record. If the
> > action is omitted, the default action is to print the input record.
> > BEGIN and END patterns must have an action, which are sometimes called
> BEGIN
> > and END blocks.

> > There's a variable called ARGC and another variable called ARGV, which
> > contain, respectively, the number of command-line arguments meant
> > to be processed by the script (roughly, the list of files), and the
> > arguments themselves. When all the BEGIN blocks have been executed,
> > awk starts evaluating the ARGV array from 1 to ARGC-1. Any elements
which
> > have the form

> >  variable=value

> > cause the value to be assigned to the named variable. Any other string
> > which is not zero-length is treated as a file name. The file is opened,
> > then each record is read and all the patterns are executed against it
> > as noted above.

> > In the example, we'll have
> >  ARGC = 3        # OK, its the number of files plus 1.
> >  ARGV[0] = "awk" # usually. anyway, it's unlikely to be useful
> >  ARGV[1] = "smallfile.txt"
> >  ARGV[2] = "bigfile.txt"

> > awk will open smallfile.txt, read each record (by defaul, delimited by
> > a new-line), and apply the tests

> >  ARGIND == 1
> > and
> >  $1 in x

> > against each record. If the first test is true for some record, these
> > statements are executed:

> >   x[$0]
> >   next

> > If the second test is true, the default action, printing the input
record,
> > is performed.

> > To understand most of that, you need to know something about arrays.
> > Arrays in awk can have non-numeric indices. This style of array is
> sometimes
> > called `associative'. Associative arrays are often used to store their
> > indices rather than the value associated with each index. awk
facilitates
> > this by adding an index to the array any time you refer to it. The
> > statement

> >   x[$0]

> > associates the value "" with the index $0 ($0 is the input record).
i.e.,
> > it adds each input record to the list of indices of the associative
array.

> > The operator `in' tests whether its left argument is defined as an
> > index of its right argument, which must be an array. So the pattern

> >   $1 in x

> > is true for any value of $1 which matches a value of $0 against which
> >   x[$0]
> > had been executed.

> > awk has several special variables which are assigned values at the whim
> > of the interpreter. ARGIND is a non-standard special variable which is
> > set to the index into the ARGC array of the current file being
processed.

> > It works only with gawk, to the best of my knowledge. The test

> >  ARGIND == 1

> > will be true only for the first file being processed, i.e.,
smallfile.txt.
> > A more portable way to do the same test is

> >  FILENAME == ARGC[1]

> > FILENAME is another special variable which contains the name of the file
> > being processed. It has the benefit of working with any* awk.

> > next is a control-flow keyword which says to skip the remaining
> pattern/action
> > pairs and read the next input record. It's here to ensure the second
> > pattern is never tested against records from the first file.

> > Putting all that together, we have

> >  # make each record in smallfile.txt an index to the array x
> >  ARGIND == 1 { x[$0]; next }

> >  # for each reacord of of bigfile.txt, if it's an index to the array x,
> >  # and was therefore in file smallfile.txt, print it
> >  $1 in x

> > I hope this was helpful.

> > * there are a few significant portability issues with the awk
> >   language. The first is that the language was significantly
> >   revised in 1987, leading to `old' awk and `new' awk. Almost
> >   every system provides `new' awk as its default, but Sun
> >   persists in shipping `old' awk. The best advice is to put
> >   /usr/xpg4/bin first in your path when working with Solaris,
> >   and write a {*filter*} letter to your Sun sales rep. I've also
> >   seen this on Dynix, but I use a very old version of Dynix.
> >   The other problems are that gawk, the GNU implementation,
> >   contains a handful of language extensions, and that POSIX
> >   introduced some changes in regular expressions.

> >   All this is to say that by `any' awk, I mean any awk except
> >   for `old' awk, which may be the default on your platform
> >   for no good reason.
> > --

> > Patrick TJ McPhee
> > East York  Canada




Tue, 11 Jan 2005 04:31:06 GMT  
 matching records in a comma delimited file

[in a previous message, what is ARGIND?]

We have a command-line

 awk -f a b c d

This reads a script from a file called a, and sets up ARGC and ARGV like
this:

  ARGC = 4
  ARGV[0] = "awk" # or something equally useless
  ARGV[1] = "b"
  ARGV[2] = "c"
  ARGV[3] = "d"

Next, it processes all the BEGIN blocks in the script (file a). Having
done that, it loops through ARGV from 1 to ARGC, reads each input
record from the files it finds there, and applies each of the patterns
against each input record. Please note that this is all `roughly speaking',
since you can do various things to affect the flow of control.

In standard awk, there's no way of knowing which command line argument
is being processed at any time. gawk has a special variable called
ARGIND which has the value of the current index into ARGV. You could
think of the pattern/action loop as being like this:

 for (ARGIND = 1; ARGIND < ARGC; ARGIND++) {
    # I can sense right now that this next line is not helping to
    # simplify the concept. Sorry.
    while ((getline < ARGV[ARGIND]) > 0) {
      if (pattern1) action1
      if (pattern2) action2
      ...
    }
 }

% Another thing which is confusing me is "How does awk know that it has to
% check the values in the small file against the values in the FIRST FIELD of
% the big file and not the 3rd or the 4th field".

This is the line from the script

  $1 in x

$1 means `the FIRST FIELD' of the record. We know that it's coming from
the `big file' because the last statement in the ARGIND == 1 statement
block is `next' which says to skip to the next record and start applying
patterns over again (it's like a continue in the while loop I wrote above).

% Is it possible for someone to post the long form/hand of the solution where
% every step is visible to me.

The only real short-hand I see here is the omission of { print } in the
second pattern/action. I suppose `x' could be called `entryfromsmallfile'
to make it explicit what it's doing.

When I do something like this, I tend to read in the keyword file in a
begin block. The begin would be something like this:
 BEGIN {
     # load keywords into array
     while ((getline < ARGV[1]) > 0)
       keywords[$1]
     FS = ","
     delete ARGV[1] # this makes the pattern/action loops start at 2, r.s.
 }

 # test for keywords from each file on the command-line, except the first one
 $1 in keywords { print }

% Also whats the best way to learn awk and shell scripting in general. Any
% suggestion for Good Books/ Web sites.

Keep solving real problems. For instance, if there's some processing
you need to do that you'd normally do in pl/sql, you could dump it out
to a file and write some awk scripts instead.

--

Patrick TJ McPhee
East York  Canada



Tue, 11 Jan 2005 09:48:20 GMT  
 matching records in a comma delimited file


Quote:


> [in a previous message, what is ARGIND?]

> We have a command-line

>  awk -f a b c d

> This reads a script from a file called a, and sets up ARGC and ARGV like
> this:

>   ARGC = 4
>   ARGV[0] = "awk" # or something equally useless
>   ARGV[1] = "b"
>   ARGV[2] = "c"
>   ARGV[3] = "d"

> Next, it processes all the BEGIN blocks in the script (file a). Having
> done that, it loops through ARGV from 1 to ARGC, reads each input
> record from the files it finds there, and applies each of the patterns
> against each input record. Please note that this is all `roughly
speaking',
> since you can do various things to affect the flow of control.

> In standard awk, there's no way of knowing which command line argument
> is being processed at any time. gawk has a special variable called
> ARGIND which has the value of the current index into ARGV. You could
> think of the pattern/action loop as being like this:

>  for (ARGIND = 1; ARGIND < ARGC; ARGIND++) {
>     # I can sense right now that this next line is not helping to
>     # simplify the concept. Sorry.
>     while ((getline < ARGV[ARGIND]) > 0) {
>       if (pattern1) action1
>       if (pattern2) action2
>       ...
>     }
>  }

> % Another thing which is confusing me is "How does awk know that it has to
> % check the values in the small file against the values in the FIRST FIELD
of
> % the big file and not the 3rd or the 4th field".

> This is the line from the script

>   $1 in x

> $1 means `the FIRST FIELD' of the record. We know that it's coming from
> the `big file' because the last statement in the ARGIND == 1 statement
> block is `next' which says to skip to the next record and start applying
> patterns over again (it's like a continue in the while loop I wrote
above).

> % Is it possible for someone to post the long form/hand of the solution
where
> % every step is visible to me.

> The only real short-hand I see here is the omission of { print } in the
> second pattern/action. I suppose `x' could be called `entryfromsmallfile'
> to make it explicit what it's doing.

> When I do something like this, I tend to read in the keyword file in a
> begin block. The begin would be something like this:
>  BEGIN {
>      # load keywords into array
>      while ((getline < ARGV[1]) > 0)
>        keywords[$1]
>      FS = ","
>      delete ARGV[1] # this makes the pattern/action loops start at 2, r.s.
>  }

>  # test for keywords from each file on the command-line, except the first
one
>  $1 in keywords { print }

> % Also whats the best way to learn awk and shell scripting in general. Any
> % suggestion for Good Books/ Web sites.

> Keep solving real problems. For instance, if there's some processing
> you need to do that you'd normally do in pl/sql, you could dump it out
> to a file and write some awk scripts instead.

> --

> Patrick TJ McPhee
> East York  Canada


Errata?

Quote:
>it loops through ARGV from 1 to ARGC

    s/b
it loops through ARGV from 1 to ARGC-1

Quote:
>there's no way of knowing which...argument is being processed

    s/b
there's no built-in way of knowing which...argument is being processed

ARGIND can be faked reasonably well with an initial (expr){action}
    FNR == 1 {++ARGIND}
or
    FILENAME != lastFILENAME {++ARGIND; lastFILENAME = FILENAME}



Tue, 11 Jan 2005 15:14:02 GMT  
 
 [ 22 post ]  Go to page: [1] [2]

 Relevant Pages 
 

 
Powered by phpBB® Forum Software