matching records in a comma delimited file
Author |
Message |
Patrick TJ McPh #16 / 22
|
 matching records in a comma delimited file
% gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt > % filteredrecords.txt What this does is load the contents of smallfile.txt into an array, then look up the first field of each record in bigfile.txt in that array, and print the ones that are found. Here's a more comprehensive explanation: An awk program is made up of a series of patterns and actions, with the actions enclosed in braces. Roughly speaking, each pattern is evaluated against each record of the files listed after the program, and if it evaluates to a non-zero number or non-zero-length string, the associated action is executed. There are two special patterns, BEGIN, which is true before any of the files listed on the command-line have been processed, and END, which is true after all the files have been processed. Either the pattern or the action can be omitted. If the pattern is omitted, the action is executed for every input record. If the action is omitted, the default action is to print the input record. BEGIN and END patterns must have an action, which are sometimes called BEGIN and END blocks. There's a variable called ARGC and another variable called ARGV, which contain, respectively, the number of command-line arguments meant to be processed by the script (roughly, the list of files), and the arguments themselves. When all the BEGIN blocks have been executed, awk starts evaluating the ARGV array from 1 to ARGC-1. Any elements which have the form variable=value cause the value to be assigned to the named variable. Any other string which is not zero-length is treated as a file name. The file is opened, then each record is read and all the patterns are executed against it as noted above. In the example, we'll have ARGC = 3 # OK, its the number of files plus 1. ARGV[0] = "awk" # usually. anyway, it's unlikely to be useful ARGV[1] = "smallfile.txt" ARGV[2] = "bigfile.txt" awk will open smallfile.txt, read each record (by defaul, delimited by a new-line), and apply the tests ARGIND == 1 and $1 in x against each record. If the first test is true for some record, these statements are executed: x[$0] next If the second test is true, the default action, printing the input record, is performed. To understand most of that, you need to know something about arrays. Arrays in awk can have non-numeric indices. This style of array is sometimes called `associative'. Associative arrays are often used to store their indices rather than the value associated with each index. awk facilitates this by adding an index to the array any time you refer to it. The statement x[$0] associates the value "" with the index $0 ($0 is the input record). i.e., it adds each input record to the list of indices of the associative array. The operator `in' tests whether its left argument is defined as an index of its right argument, which must be an array. So the pattern $1 in x is true for any value of $1 which matches a value of $0 against which x[$0] had been executed. awk has several special variables which are assigned values at the whim of the interpreter. ARGIND is a non-standard special variable which is set to the index into the ARGC array of the current file being processed. It works only with gawk, to the best of my knowledge. The test ARGIND == 1 will be true only for the first file being processed, i.e., smallfile.txt. A more portable way to do the same test is FILENAME == ARGC[1] FILENAME is another special variable which contains the name of the file being processed. It has the benefit of working with any* awk. next is a control-flow keyword which says to skip the remaining pattern/action pairs and read the next input record. It's here to ensure the second pattern is never tested against records from the first file. Putting all that together, we have # make each record in smallfile.txt an index to the array x ARGIND == 1 { x[$0]; next } # for each reacord of of bigfile.txt, if it's an index to the array x, # and was therefore in file smallfile.txt, print it $1 in x I hope this was helpful. * there are a few significant portability issues with the awk language. The first is that the language was significantly revised in 1987, leading to `old' awk and `new' awk. Almost every system provides `new' awk as its default, but Sun persists in shipping `old' awk. The best advice is to put /usr/xpg4/bin first in your path when working with Solaris, and write a {*filter*} letter to your Sun sales rep. I've also seen this on Dynix, but I use a very old version of Dynix. The other problems are that gawk, the GNU implementation, contains a handful of language extensions, and that POSIX introduced some changes in regular expressions. All this is to say that by `any' awk, I mean any awk except for `old' awk, which may be the default on your platform for no good reason. -- Patrick TJ McPhee East York Canada
|
Sun, 09 Jan 2005 11:27:43 GMT |
|
 |
Kenny McCorma #17 / 22
|
 matching records in a comma delimited file
... Quote: >* there are a few significant portability issues with the awk > language. The first is that the language was significantly > revised in 1987, leading to `old' awk and `new' awk. Almost > every system provides `new' awk as its default, but Sun > persists in shipping `old' awk. The best (**) advice is to put > /usr/xpg4/bin first in your path when working with Solaris, > and write a {*filter*} letter to your Sun sales rep. I've also > seen this on Dynix, but I use a very old version of Dynix. > The other problems are that gawk, the GNU implementation, > contains a handful of language extensions, and that POSIX > introduced some changes in regular expressions.
(**) Better advice is to get (a current version of) GAWK, compile it on all the platforms you will use (consider it as essential as air and water), become familiar and happy with the nifty extensions it provides, and then stop worrying about all this portability nonsense that makes up 90% of the traffic in this NG (1). To me, the whole point of GAWK is to provide a common, cross-platform solution that saves people from playing the Russian Roulette that is the norm when using vendor-supplied AWKs (2). (1) It occurred to me the other day that we don't discuss much AWK here; rather we continually ponder the existential question: "What is AWK?" (2) Just to make this explicit, consider that the following 3 platforms probably account for a very large percentage of AWK use (yes, I know, you are going to say "But I use Dynix", well, fine...): (presented in no particular order) a) Linux b) Solaris c) DOS and its spawn (I.e., PC OSes) Now just look at the RR aspects of using "awk" on these platforms. On Linux, you will almost certainly get GAWK, although I know of at least one distro where the default is MAWK, and you have to explicitly install GAWK to get it. On Solaris, you get, well, its been well documented what you get under Solaris. And, on the PC, you get God knows what from MicroSludge. How much simpler life is if you just use GAWK on all these platforms! It is, after all, the second best implementation of AWK in the world! The point is, I really don't see why you should contort your code to accommodate crappy installations, when you should always be able to get GAWK for your platform.
|
Tue, 11 Jan 2005 00:59:50 GMT |
|
 |
Ronnie Your #18 / 22
|
 matching records in a comma delimited file
Thanks a Lot Patrick for the extensive explanation. I am a little confused about ARGIND variable. I tried looking up for simple explanations on the web but couldnt fine one . Can you or some Unix Guru please explain what ARGIND is... with a small simple example if possible. It would greatly help me in understanding the issue in hand. Thanks a Lot for all the Help Ronnie Yours
Quote:
> % gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt > % filteredrecords.txt > What this does is load the contents of smallfile.txt into an > array, then look up the first field of each record in bigfile.txt > in that array, and print the ones that are found. > Here's a more comprehensive explanation: > An awk program is made up of a series of patterns and actions, with the > actions enclosed in braces. Roughly speaking, each pattern is evaluated > against each record of the files listed after the program, and if it > evaluates to a non-zero number or non-zero-length string, the associated > action is executed. There are two special patterns, BEGIN, which is true > before any of the files listed on the command-line have been processed, > and END, which is true after all the files have been processed. > Either the pattern or the action can be omitted. If the pattern is > omitted, the action is executed for every input record. If the > action is omitted, the default action is to print the input record. > BEGIN and END patterns must have an action, which are sometimes called BEGIN > and END blocks. > There's a variable called ARGC and another variable called ARGV, which > contain, respectively, the number of command-line arguments meant > to be processed by the script (roughly, the list of files), and the > arguments themselves. When all the BEGIN blocks have been executed, > awk starts evaluating the ARGV array from 1 to ARGC-1. Any elements which > have the form > variable=value > cause the value to be assigned to the named variable. Any other string > which is not zero-length is treated as a file name. The file is opened, > then each record is read and all the patterns are executed against it > as noted above. > In the example, we'll have > ARGC = 3 # OK, its the number of files plus 1. > ARGV[0] = "awk" # usually. anyway, it's unlikely to be useful > ARGV[1] = "smallfile.txt" > ARGV[2] = "bigfile.txt" > awk will open smallfile.txt, read each record (by defaul, delimited by > a new-line), and apply the tests > ARGIND == 1 > and > $1 in x > against each record. If the first test is true for some record, these > statements are executed: > x[$0] > next > If the second test is true, the default action, printing the input record, > is performed. > To understand most of that, you need to know something about arrays. > Arrays in awk can have non-numeric indices. This style of array is sometimes > called `associative'. Associative arrays are often used to store their > indices rather than the value associated with each index. awk facilitates > this by adding an index to the array any time you refer to it. The > statement > x[$0] > associates the value "" with the index $0 ($0 is the input record). i.e., > it adds each input record to the list of indices of the associative array. > The operator `in' tests whether its left argument is defined as an > index of its right argument, which must be an array. So the pattern > $1 in x > is true for any value of $1 which matches a value of $0 against which > x[$0] > had been executed. > awk has several special variables which are assigned values at the whim > of the interpreter. ARGIND is a non-standard special variable which is > set to the index into the ARGC array of the current file being processed. > It works only with gawk, to the best of my knowledge. The test > ARGIND == 1 > will be true only for the first file being processed, i.e., smallfile.txt. > A more portable way to do the same test is > FILENAME == ARGC[1] > FILENAME is another special variable which contains the name of the file > being processed. It has the benefit of working with any* awk. > next is a control-flow keyword which says to skip the remaining pattern/action > pairs and read the next input record. It's here to ensure the second > pattern is never tested against records from the first file. > Putting all that together, we have > # make each record in smallfile.txt an index to the array x > ARGIND == 1 { x[$0]; next } > # for each reacord of of bigfile.txt, if it's an index to the array x, > # and was therefore in file smallfile.txt, print it > $1 in x > I hope this was helpful. > * there are a few significant portability issues with the awk > language. The first is that the language was significantly > revised in 1987, leading to `old' awk and `new' awk. Almost > every system provides `new' awk as its default, but Sun > persists in shipping `old' awk. The best advice is to put > /usr/xpg4/bin first in your path when working with Solaris, > and write a {*filter*} letter to your Sun sales rep. I've also > seen this on Dynix, but I use a very old version of Dynix. > The other problems are that gawk, the GNU implementation, > contains a handful of language extensions, and that POSIX > introduced some changes in regular expressions. > All this is to say that by `any' awk, I mean any awk except > for `old' awk, which may be the default on your platform > for no good reason. > -- > Patrick TJ McPhee > East York Canada
|
Tue, 11 Jan 2005 03:05:59 GMT |
|
 |
Kenny McCorma #19 / 22
|
 matching records in a comma delimited file
Quote:
>Thanks a Lot Patrick for the extensive explanation. >I am a little confused about ARGIND variable. I tried looking up for simple >explanations on the web but couldnt fine one .
As Patrick and others have noted, ARGIND is GAWK-specific. So, you have to look in a GAWK manual (or do "man gawk") to find out what it does and how it is used. Both GAWK & TAWK have direct ways of telling which file you are on (and of the two, GAWK's method is preferable); various kludges can be used to get this functionality in other implementations.
|
Tue, 11 Jan 2005 03:34:00 GMT |
|
 |
Ronnie Your #20 / 22
|
 matching records in a comma delimited file
Another thing which is confusing me is "How does awk know that it has to check the values in the small file against the values in the FIRST FIELD of the big file and not the 3rd or the 4th field". I am sorry about constantly bugging you with questions that might be trivial to most people but I was basically a windows guy throughout my career and am in the process of transition to Unix. The solution although it works great is in short hand and because of that its difficult for me to understand it properly although I am getting a feel of whats happenning I am still not fully comfortable. Is it possible for someone to post the long form/hand of the solution where every step is visible to me. gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt Quote: > filteredrecords.txt
Also whats the best way to learn awk and shell scripting in general. Any suggestion for Good Books/ Web sites. Thanks Ronnie Yours Oracle DBA
Quote: > Thanks a Lot Patrick for the extensive explanation. > I am a little confused about ARGIND variable. I tried looking up for simple > explanations on the web but couldnt fine one . > Can you or some Unix Guru please explain what ARGIND is... with a small > simple example if possible. > It would greatly help me in understanding the issue in hand. > Thanks a Lot for all the Help > Ronnie Yours
> > % gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt > > % filteredrecords.txt > > What this does is load the contents of smallfile.txt into an > > array, then look up the first field of each record in bigfile.txt > > in that array, and print the ones that are found. > > Here's a more comprehensive explanation: > > An awk program is made up of a series of patterns and actions, with the > > actions enclosed in braces. Roughly speaking, each pattern is evaluated > > against each record of the files listed after the program, and if it > > evaluates to a non-zero number or non-zero-length string, the associated > > action is executed. There are two special patterns, BEGIN, which is true > > before any of the files listed on the command-line have been processed, > > and END, which is true after all the files have been processed. > > Either the pattern or the action can be omitted. If the pattern is > > omitted, the action is executed for every input record. If the > > action is omitted, the default action is to print the input record. > > BEGIN and END patterns must have an action, which are sometimes called > BEGIN > > and END blocks. > > There's a variable called ARGC and another variable called ARGV, which > > contain, respectively, the number of command-line arguments meant > > to be processed by the script (roughly, the list of files), and the > > arguments themselves. When all the BEGIN blocks have been executed, > > awk starts evaluating the ARGV array from 1 to ARGC-1. Any elements which > > have the form > > variable=value > > cause the value to be assigned to the named variable. Any other string > > which is not zero-length is treated as a file name. The file is opened, > > then each record is read and all the patterns are executed against it > > as noted above. > > In the example, we'll have > > ARGC = 3 # OK, its the number of files plus 1. > > ARGV[0] = "awk" # usually. anyway, it's unlikely to be useful > > ARGV[1] = "smallfile.txt" > > ARGV[2] = "bigfile.txt" > > awk will open smallfile.txt, read each record (by defaul, delimited by > > a new-line), and apply the tests > > ARGIND == 1 > > and > > $1 in x > > against each record. If the first test is true for some record, these > > statements are executed: > > x[$0] > > next > > If the second test is true, the default action, printing the input record, > > is performed. > > To understand most of that, you need to know something about arrays. > > Arrays in awk can have non-numeric indices. This style of array is > sometimes > > called `associative'. Associative arrays are often used to store their > > indices rather than the value associated with each index. awk facilitates > > this by adding an index to the array any time you refer to it. The > > statement > > x[$0] > > associates the value "" with the index $0 ($0 is the input record). i.e., > > it adds each input record to the list of indices of the associative array. > > The operator `in' tests whether its left argument is defined as an > > index of its right argument, which must be an array. So the pattern > > $1 in x > > is true for any value of $1 which matches a value of $0 against which > > x[$0] > > had been executed. > > awk has several special variables which are assigned values at the whim > > of the interpreter. ARGIND is a non-standard special variable which is > > set to the index into the ARGC array of the current file being processed. > > It works only with gawk, to the best of my knowledge. The test > > ARGIND == 1 > > will be true only for the first file being processed, i.e., smallfile.txt. > > A more portable way to do the same test is > > FILENAME == ARGC[1] > > FILENAME is another special variable which contains the name of the file > > being processed. It has the benefit of working with any* awk. > > next is a control-flow keyword which says to skip the remaining > pattern/action > > pairs and read the next input record. It's here to ensure the second > > pattern is never tested against records from the first file. > > Putting all that together, we have > > # make each record in smallfile.txt an index to the array x > > ARGIND == 1 { x[$0]; next } > > # for each reacord of of bigfile.txt, if it's an index to the array x, > > # and was therefore in file smallfile.txt, print it > > $1 in x > > I hope this was helpful. > > * there are a few significant portability issues with the awk > > language. The first is that the language was significantly > > revised in 1987, leading to `old' awk and `new' awk. Almost > > every system provides `new' awk as its default, but Sun > > persists in shipping `old' awk. The best advice is to put > > /usr/xpg4/bin first in your path when working with Solaris, > > and write a {*filter*} letter to your Sun sales rep. I've also > > seen this on Dynix, but I use a very old version of Dynix. > > The other problems are that gawk, the GNU implementation, > > contains a handful of language extensions, and that POSIX > > introduced some changes in regular expressions. > > All this is to say that by `any' awk, I mean any awk except > > for `old' awk, which may be the default on your platform > > for no good reason. > > -- > > Patrick TJ McPhee > > East York Canada
|
Tue, 11 Jan 2005 04:31:06 GMT |
|
 |
Patrick TJ McPh #21 / 22
|
 matching records in a comma delimited file
[in a previous message, what is ARGIND?] We have a command-line awk -f a b c d This reads a script from a file called a, and sets up ARGC and ARGV like this: ARGC = 4 ARGV[0] = "awk" # or something equally useless ARGV[1] = "b" ARGV[2] = "c" ARGV[3] = "d" Next, it processes all the BEGIN blocks in the script (file a). Having done that, it loops through ARGV from 1 to ARGC, reads each input record from the files it finds there, and applies each of the patterns against each input record. Please note that this is all `roughly speaking', since you can do various things to affect the flow of control. In standard awk, there's no way of knowing which command line argument is being processed at any time. gawk has a special variable called ARGIND which has the value of the current index into ARGV. You could think of the pattern/action loop as being like this: for (ARGIND = 1; ARGIND < ARGC; ARGIND++) { # I can sense right now that this next line is not helping to # simplify the concept. Sorry. while ((getline < ARGV[ARGIND]) > 0) { if (pattern1) action1 if (pattern2) action2 ... } } % Another thing which is confusing me is "How does awk know that it has to % check the values in the small file against the values in the FIRST FIELD of % the big file and not the 3rd or the 4th field". This is the line from the script $1 in x $1 means `the FIRST FIELD' of the record. We know that it's coming from the `big file' because the last statement in the ARGIND == 1 statement block is `next' which says to skip to the next record and start applying patterns over again (it's like a continue in the while loop I wrote above). % Is it possible for someone to post the long form/hand of the solution where % every step is visible to me. The only real short-hand I see here is the omission of { print } in the second pattern/action. I suppose `x' could be called `entryfromsmallfile' to make it explicit what it's doing. When I do something like this, I tend to read in the keyword file in a begin block. The begin would be something like this: BEGIN { # load keywords into array while ((getline < ARGV[1]) > 0) keywords[$1] FS = "," delete ARGV[1] # this makes the pattern/action loops start at 2, r.s. } # test for keywords from each file on the command-line, except the first one $1 in keywords { print } % Also whats the best way to learn awk and shell scripting in general. Any % suggestion for Good Books/ Web sites. Keep solving real problems. For instance, if there's some processing you need to do that you'd normally do in pl/sql, you could dump it out to a file and write some awk scripts instead. -- Patrick TJ McPhee East York Canada
|
Tue, 11 Jan 2005 09:48:20 GMT |
|
 |
Dan Haygoo #22 / 22
|
 matching records in a comma delimited file
Quote:
> [in a previous message, what is ARGIND?] > We have a command-line > awk -f a b c d > This reads a script from a file called a, and sets up ARGC and ARGV like > this: > ARGC = 4 > ARGV[0] = "awk" # or something equally useless > ARGV[1] = "b" > ARGV[2] = "c" > ARGV[3] = "d" > Next, it processes all the BEGIN blocks in the script (file a). Having > done that, it loops through ARGV from 1 to ARGC, reads each input > record from the files it finds there, and applies each of the patterns > against each input record. Please note that this is all `roughly speaking', > since you can do various things to affect the flow of control. > In standard awk, there's no way of knowing which command line argument > is being processed at any time. gawk has a special variable called > ARGIND which has the value of the current index into ARGV. You could > think of the pattern/action loop as being like this: > for (ARGIND = 1; ARGIND < ARGC; ARGIND++) { > # I can sense right now that this next line is not helping to > # simplify the concept. Sorry. > while ((getline < ARGV[ARGIND]) > 0) { > if (pattern1) action1 > if (pattern2) action2 > ... > } > } > % Another thing which is confusing me is "How does awk know that it has to > % check the values in the small file against the values in the FIRST FIELD of > % the big file and not the 3rd or the 4th field". > This is the line from the script > $1 in x > $1 means `the FIRST FIELD' of the record. We know that it's coming from > the `big file' because the last statement in the ARGIND == 1 statement > block is `next' which says to skip to the next record and start applying > patterns over again (it's like a continue in the while loop I wrote above). > % Is it possible for someone to post the long form/hand of the solution where > % every step is visible to me. > The only real short-hand I see here is the omission of { print } in the > second pattern/action. I suppose `x' could be called `entryfromsmallfile' > to make it explicit what it's doing. > When I do something like this, I tend to read in the keyword file in a > begin block. The begin would be something like this: > BEGIN { > # load keywords into array > while ((getline < ARGV[1]) > 0) > keywords[$1] > FS = "," > delete ARGV[1] # this makes the pattern/action loops start at 2, r.s. > } > # test for keywords from each file on the command-line, except the first one > $1 in keywords { print } > % Also whats the best way to learn awk and shell scripting in general. Any > % suggestion for Good Books/ Web sites. > Keep solving real problems. For instance, if there's some processing > you need to do that you'd normally do in pl/sql, you could dump it out > to a file and write some awk scripts instead. > -- > Patrick TJ McPhee > East York Canada
Errata? Quote: >it loops through ARGV from 1 to ARGC
s/b it loops through ARGV from 1 to ARGC-1 Quote: >there's no way of knowing which...argument is being processed
s/b there's no built-in way of knowing which...argument is being processed ARGIND can be faked reasonably well with an initial (expr){action} FNR == 1 {++ARGIND} or FILENAME != lastFILENAME {++ARGIND; lastFILENAME = FILENAME}
|
Tue, 11 Jan 2005 15:14:02 GMT |
|
|
Page 2 of 2
|
[ 22 post ] |
|
Go to page:
[1]
[2] |
|