matching records in a comma delimited file
Author |
Message |
Ronnie Your #1 / 22
|
 matching records in a comma delimited file
Hi, I have a comma delimited file as REG,12619,GITT,LOM,52891 GTU,15679211,GTSP,LOSN,82248 TAP1132,16791,JSAEDJH,SSOPA.,129225 SR,17551,AAFSSSA,89114120,AAU FEP,22576,SSJ,222,DEP The above file would have around 8-10 million records. There is another file which has the records based on which I need to filter the above file. This will have around 600-700 records REG E SR SPSDDS WSH DSSL I tried grep -f as grep -f smallfile.txt bigfile.txt > filteredrecords.txt But the problem is that grep matches the whole record for occurrences of the string. I would like the records in the small file to match the values in the first field of the big file. Hence I would want the result to be REG,12619,GITT,LOM,52891 SR,17551,AAFSSSA,89114120,AAU But grep gives me REG,12619,GITT,LOM,52891 TAP1132,16791,JSAEDJH,SSOPA.,129225 FEP,22576,SSJ,222,DEP Any pointers how do I go about doing this. Thanks Ronnie
|
Fri, 07 Jan 2005 21:07:06 GMT |
|
 |
Torsten Kirschne #2 / 22
|
 matching records in a comma delimited file
Quote:
> Hi, > I have a comma delimited file as > REG,12619,GITT,LOM,52891 > GTU,15679211,GTSP,LOSN,82248 > TAP1132,16791,JSAEDJH,SSOPA.,129225 > SR,17551,AAFSSSA,89114120,AAU > FEP,22576,SSJ,222,DEP > The above file would have around 8-10 million records. > There is another file which has the records based on which I need to > filter the above file. > This will have around 600-700 records > REG > E > SR > SPSDDS > WSH > DSSL > I tried grep -f as > grep -f smallfile.txt bigfile.txt > filteredrecords.txt > But the problem is that grep matches the whole record for occurrences of > the string. > I would like the records in the small file to match the values in the > first field of the big file. > Hence I would want the result to be > REG,12619,GITT,LOM,52891 > SR,17551,AAFSSSA,89114120,AAU > But grep gives me > REG,12619,GITT,LOM,52891 > TAP1132,16791,JSAEDJH,SSOPA.,129225 > FEP,22576,SSJ,222,DEP
Maybe this helps... 1. sort datafile > data.sorted 2. sort patternfile > pattern.sorted 3. join -1 1 -2 1 -t, data.sorted pattern.sorted I am somewhat uncertain about whether join works with giant files like Yours. Other than join, nawk, and perl come to mind.
|
Fri, 07 Jan 2005 21:54:17 GMT |
|
 |
Robert Stearn #3 / 22
|
 matching records in a comma delimited file
A simple awk script (possibly space consumptive, depending on your awk version): awk 'NR==FNR { wanted[$1]=1; next }; $1 in wanted { print $0)' \ keyfile datafile > outfile This is untested code, proceed with caution. Quote:
> Hi, > I have a comma delimited file as > REG,12619,GITT,LOM,52891 > GTU,15679211,GTSP,LOSN,82248 > TAP1132,16791,JSAEDJH,SSOPA.,129225 > SR,17551,AAFSSSA,89114120,AAU > FEP,22576,SSJ,222,DEP > The above file would have around 8-10 million records. > There is another file which has the records based on which I need to filter > the above file. > This will have around 600-700 records > REG > E > SR > SPSDDS > WSH > DSSL > I tried grep -f as > grep -f smallfile.txt bigfile.txt > filteredrecords.txt > But the problem is that grep matches the whole record for occurrences of the > string. > I would like the records in the small file to match the values in the first > field of the big file. > Hence I would want the result to be > REG,12619,GITT,LOM,52891 > SR,17551,AAFSSSA,89114120,AAU > But grep gives me > REG,12619,GITT,LOM,52891 > TAP1132,16791,JSAEDJH,SSOPA.,129225 > FEP,22576,SSJ,222,DEP > Any pointers how do I go about doing this. > Thanks > Ronnie
|
Fri, 07 Jan 2005 22:09:17 GMT |
|
 |
Dan Haygoo #4 / 22
|
 matching records in a comma delimited file
Quote: > Hi, > I have a comma delimited file as > REG,12619,GITT,LOM,52891 > GTU,15679211,GTSP,LOSN,82248 > TAP1132,16791,JSAEDJH,SSOPA.,129225 > SR,17551,AAFSSSA,89114120,AAU > FEP,22576,SSJ,222,DEP > The above file would have around 8-10 million records. > There is another file which has the records based on which I need to filter > the above file. > This will have around 600-700 records > REG > E > SR > SPSDDS > WSH > DSSL > I tried grep -f as > grep -f smallfile.txt bigfile.txt > filteredrecords.txt > But the problem is that grep matches the whole record for occurrences of the > string. > I would like the records in the small file to match the values in the first > field of the big file. > Hence I would want the result to be > REG,12619,GITT,LOM,52891 > SR,17551,AAFSSSA,89114120,AAU > But grep gives me > REG,12619,GITT,LOM,52891 > TAP1132,16791,JSAEDJH,SSOPA.,129225 > FEP,22576,SSJ,222,DEP > Any pointers how do I go about doing this. > Thanks > Ronnie
Ronnie - Look at join if you you don't want any fancy processing. If you want to do some nifty statistics couting or field calculation, then awk might be better. Here are a couple of awk solutions: First, you can read the keys into an array, and then, for each data line, see if the key is in the array. Use FS to acheive the comma-delimiting. This solution is kind of bad, since, for every line, you have to search the whole array of keys to find a match. If you have n data lines, and awk just happens to store the array indicies in a tree, this ends up being an n*log(m) solution. Second, you can sort the data, and then walk through the data and keys together, require neither pre-reading and storage, nor a search through a list, making this an n+m solution (after sorting, of course). This way does give sorted output. - Dan join1.awk ---------- BEGIN { FS = "," keyFile = ARGV[1] delete ARGV[1] while ( (ok = (getline < keyFile)) > 0 ) keys[$0] close(keyFile) if (ok < 0) { print "keyFile error"; exit } } $1 in keys { print } keys ---------- DSSL E REG SPSDDS SR WSH data ---------- FEP,22576,SSJ,222,DEP GTU,15679211,GTSP,LOSN,82248 REG,12619,GITT,LOM,52891 SR,17551,AAFSSSA,89114120,AAU TAP1132,16791,JSAEDJH,SSOPA.,129225 (output) ---------- >awk -f join1.awk keys data REG,12619,GITT,LOM,52891 SR,17551,AAFSSSA,89114120,AAU join2.awk ---------- BEGIN { FS = "," keyFile = ARGV[1] delete ARGV[1] } $1 > key { while ( (ok = (getline key < keyFile)) > 0 && key < $1 ) ; if (ok < 0) { print "keyFile error"; exit } } $1 == key { print } END { if (!ok) print "getline error" } (output) (gawk/Win98) ---------- >sort keys > keys.srt >type keys.srt DSSL E REG SPSDDS SR WSH >sort data > data.srt >type data.srt FEP,22576,SSJ,222,DEP GTU,15679211,GTSP,LOSN,82248 REG,12619,GITT,LOM,52891 SR,17551,AAFSSSA,89114120,AAU TAP1132,16791,JSAEDJH,SSOPA.,129225 >awk -f join2.awk keys.srt data.srt REG,12619,GITT,LOM,52891 SR,17551,AAFSSSA,89114120,AAU
|
Fri, 07 Jan 2005 22:15:14 GMT |
|
 |
Kenny McCorma #5 / 22
|
 matching records in a comma delimited file
Quote:
>Hi, >I have a comma delimited file as >REG,12619,GITT,LOM,52891 >GTU,15679211,GTSP,LOSN,82248 >TAP1132,16791,JSAEDJH,SSOPA.,129225 >SR,17551,AAFSSSA,89114120,AAU >FEP,22576,SSJ,222,DEP >The above file would have around 8-10 million records.
OK - let's assume away the usual "parsing CSV" issues (that is, we assume that your file has no embedded commas or quotes or other funny stuff. Quote: >There is another file which has the records based on which I need to filter >the above file. >This will have around 600-700 records >REG >E >SR >SPSDDS >WSH >DSSL >I tried grep -f as >grep -f smallfile.txt bigfile.txt > filteredrecords.txt
gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt > filteredrecords.txt
|
Fri, 07 Jan 2005 22:32:24 GMT |
|
 |
Dan Haygoo #6 / 22
|
 matching records in a comma delimited file
Quote: > A simple awk script (possibly space consumptive, depending on your awk > version): > awk 'NR==FNR { wanted[$1]=1; next }; $1 in wanted { print $0)' \ > keyfile datafile > outfile > This is untested code, proceed with caution.
> > Hi, > > I have a comma delimited file as > > REG,12619,GITT,LOM,52891 > > GTU,15679211,GTSP,LOSN,82248 > > TAP1132,16791,JSAEDJH,SSOPA.,129225 > > SR,17551,AAFSSSA,89114120,AAU > > FEP,22576,SSJ,222,DEP > > The above file would have around 8-10 million records. > > There is another file which has the records based on which I need to filter > > the above file. > > This will have around 600-700 records > > REG > > E > > SR > > SPSDDS > > WSH > > DSSL > > I tried grep -f as > > grep -f smallfile.txt bigfile.txt > filteredrecords.txt > > But the problem is that grep matches the whole record for occurrences of the > > string. > > I would like the records in the small file to match the values in the first > > field of the big file. > > Hence I would want the result to be > > REG,12619,GITT,LOM,52891 > > SR,17551,AAFSSSA,89114120,AAU > > But grep gives me > > REG,12619,GITT,LOM,52891 > > TAP1132,16791,JSAEDJH,SSOPA.,129225 > > FEP,22576,SSJ,222,DEP > > Any pointers how do I go about doing this. > > Thanks > > Ronnie
Ronnie, Robert - Somewhere in here, you'll want the parameter "-F," to awk, or use 'BEGIN {FS = ","}' in your code. Some other general notes: - "NR == NFR/next" clauses are often more effcient if done in BEGIN with getline. But a good idea to shorten this for the command line! - For the "in" test to work, awk doesn't require the setting of the value; just the reference, so in this case, "wanted[$1]" is equivalent to "wanted[$1]=1" - $0 is the default arg to "print", so "{print}" is equivalent to "{print $0)" [sic]. - And, even though I used the "{print}" construct in my other post, the default action of a pattern is "{print}", so that can be eliminated. To save typing, I shortened "wanted" to "k". This tested output (gawk/Win98) uses data I posted elsewhere in this thread. - Dan (output) ---------- >awk -F, "NR==FNR{k[$1]; next} $1 in k" keys data REG,12619,GITT,LOM,52891 SR,17551,AAFSSSA,89114120,AAU
|
Fri, 07 Jan 2005 22:40:40 GMT |
|
 |
William Par #7 / 22
|
 matching records in a comma delimited file
Quote: > Hi, > I have a comma delimited file as > REG,12619,GITT,LOM,52891 > GTU,15679211,GTSP,LOSN,82248 > TAP1132,16791,JSAEDJH,SSOPA.,129225 > SR,17551,AAFSSSA,89114120,AAU > FEP,22576,SSJ,222,DEP > The above file would have around 8-10 million records. > There is another file which has the records based on which I need to filter > the above file. > This will have around 600-700 records > REG > E > SR > SPSDDS > WSH > DSSL > I tried grep -f as > grep -f smallfile.txt bigfile.txt > filteredrecords.txt > But the problem is that grep matches the whole record for occurrences of the > string. > I would like the records in the small file to match the values in the first > field of the big file. > Hence I would want the result to be > REG,12619,GITT,LOM,52891 > SR,17551,AAFSSSA,89114120,AAU > But grep gives me > REG,12619,GITT,LOM,52891 > TAP1132,16791,JSAEDJH,SSOPA.,129225 > FEP,22576,SSJ,222,DEP > Any pointers how do I go about doing this. > Thanks > Ronnie
Convert the search pattern to ^..., format, sed 's/.*/^&,/' smallfile.txt > s grep -f s bigfile.txt > filteredrecords.txt --
8-CPU Cluster, Hosting, NAS, Linux, LaTeX, python, vim, mutt, tin
|
Fri, 07 Jan 2005 22:56:37 GMT |
|
 |
Robert Stearn #8 / 22
|
 matching records in a comma delimited file
I made an obvious error. I left out ' -F, ' in the awk statement. It should immediately follow the 'awk', without the apostrophes. Quote:
> A simple awk script (possibly space consumptive, depending on your awk > version): > awk 'NR==FNR { wanted[$1]=1; next }; $1 in wanted { print $0)' \ > keyfile datafile > outfile > This is untested code, proceed with caution.
> > Hi, > > I have a comma delimited file as > > REG,12619,GITT,LOM,52891 > > GTU,15679211,GTSP,LOSN,82248 > > TAP1132,16791,JSAEDJH,SSOPA.,129225 > > SR,17551,AAFSSSA,89114120,AAU > > FEP,22576,SSJ,222,DEP > > The above file would have around 8-10 million records. > > There is another file which has the records based on which I need to filter > > the above file. > > This will have around 600-700 records > > REG > > E > > SR > > SPSDDS > > WSH > > DSSL > > I tried grep -f as > > grep -f smallfile.txt bigfile.txt > filteredrecords.txt > > But the problem is that grep matches the whole record for occurrences of the > > string. > > I would like the records in the small file to match the values in the first > > field of the big file. > > Hence I would want the result to be > > REG,12619,GITT,LOM,52891 > > SR,17551,AAFSSSA,89114120,AAU > > But grep gives me > > REG,12619,GITT,LOM,52891 > > TAP1132,16791,JSAEDJH,SSOPA.,129225 > > FEP,22576,SSJ,222,DEP > > Any pointers how do I go about doing this. > > Thanks > > Ronnie
|
Fri, 07 Jan 2005 23:13:24 GMT |
|
 |
Ronnie Your #9 / 22
|
 matching records in a comma delimited file
I tried the Join method and it seems to work fine for most part of it. But join doesnt seem to work for where the patterns are CHAR, CNPZY Lets take this example where the records in the data file are CWEI,1995,NORAM,SECNRPS, -30.350000, -30.350000,YR CNPZY,1995,CON,SECPDFNC, 1340.277644, 1340.947783,YR CHAR,1995,US,UDGUUT, 2500.000000, 2500.000000,YR HHL,1995,WOUSCON,RRCAPREA, -17714.285714, -17714.285714,YR COHO,1995,REGROLL,UDNPOGDW, 6.400000, 6.400000,YR The pattern file will have records as CWEI CNPZY CHAR HHL COHO Now the join join -1 1 -2 1 -t, data pattern gives the output as CWEI,1995,NORAM,SECNRPS, -30.350000, -30.350000,YR HHL,1995,WOUSCON,RRCAPREA, -17714.285714, -17714.285714,YR COHO,1995,REGROLL,UDNPOGDW, 6.400000, 6.400000,YR It has omitted the lines with CWEI and CNPZY as the first field. Is there anything I am missing here. Also when I sort sort bigfile > bigsortedfile It gives me a warning Warning: use of redirected output for large sorts is usually slower than specifying the output file name directly to sort. Please suggest. Thanks Ronnie
Quote:
> > Hi, > > I have a comma delimited file as > > REG,12619,GITT,LOM,52891 > > GTU,15679211,GTSP,LOSN,82248 > > TAP1132,16791,JSAEDJH,SSOPA.,129225 > > SR,17551,AAFSSSA,89114120,AAU > > FEP,22576,SSJ,222,DEP > > The above file would have around 8-10 million records. > > There is another file which has the records based on which I need to > > filter the above file. > > This will have around 600-700 records > > REG > > E > > SR > > SPSDDS > > WSH > > DSSL > > I tried grep -f as > > grep -f smallfile.txt bigfile.txt > filteredrecords.txt > > But the problem is that grep matches the whole record for occurrences of > > the string. > > I would like the records in the small file to match the values in the > > first field of the big file. > > Hence I would want the result to be > > REG,12619,GITT,LOM,52891 > > SR,17551,AAFSSSA,89114120,AAU > > But grep gives me > > REG,12619,GITT,LOM,52891 > > TAP1132,16791,JSAEDJH,SSOPA.,129225 > > FEP,22576,SSJ,222,DEP > Maybe this helps... > 1. sort datafile > data.sorted > 2. sort patternfile > pattern.sorted > 3. join -1 1 -2 1 -t, data.sorted pattern.sorted > I am somewhat uncertain about whether join works with > giant files like Yours. > Other than join, nawk, and perl come to mind.
|
Sat, 08 Jan 2005 00:27:38 GMT |
|
 |
Kenny McCorma #10 / 22
|
 matching records in a comma delimited file
Quote:
>I tried the Join method and it seems to work fine for most part of it. >But join doesnt seem to work for where the patterns are >CHAR, CNPZY
Why are you doing this the hard way, with sort, join, etc, etc when I and others have posted AWK solutions? Which NG are you following this in? (I.e., if you weren't interested in AWK solutions, why x-post to comp.lang.awk?)
|
Sat, 08 Jan 2005 01:11:09 GMT |
|
 |
Ronnie Your #11 / 22
|
 matching records in a comma delimited file
Hi Kenny, This solution works great for me. gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt > filteredrecords.txt Can you please explain what steps it goes through so that i understand it better. Thanks a Lot Ronnie
Quote:
> >Hi, > >I have a comma delimited file as > >REG,12619,GITT,LOM,52891 > >GTU,15679211,GTSP,LOSN,82248 > >TAP1132,16791,JSAEDJH,SSOPA.,129225 > >SR,17551,AAFSSSA,89114120,AAU > >FEP,22576,SSJ,222,DEP > >The above file would have around 8-10 million records. > OK - let's assume away the usual "parsing CSV" issues (that is, we assume > that your file has no embedded commas or quotes or other funny stuff. > >There is another file which has the records based on which I need to filter > >the above file. > >This will have around 600-700 records > >REG > >E > >SR > >SPSDDS > >WSH > >DSSL > >I tried grep -f as > >grep -f smallfile.txt bigfile.txt > filteredrecords.txt > gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt > filteredrecords.txt
|
Sat, 08 Jan 2005 05:03:10 GMT |
|
 |
John W. Krah #12 / 22
|
 matching records in a comma delimited file
Quote:
> Hi, > I have a comma delimited file as > REG,12619,GITT,LOM,52891 > GTU,15679211,GTSP,LOSN,82248 > TAP1132,16791,JSAEDJH,SSOPA.,129225 > SR,17551,AAFSSSA,89114120,AAU > FEP,22576,SSJ,222,DEP > The above file would have around 8-10 million records. > There is another file which has the records based on which I need to filter > the above file. > This will have around 600-700 records > REG > E > SR > SPSDDS > WSH > DSSL > I tried grep -f as > grep -f smallfile.txt bigfile.txt > filteredrecords.txt > But the problem is that grep matches the whole record for occurrences of the > string. > I would like the records in the small file to match the values in the first > field of the big file. > Hence I would want the result to be > REG,12619,GITT,LOM,52891 > SR,17551,AAFSSSA,89114120,AAU > But grep gives me > REG,12619,GITT,LOM,52891 > TAP1132,16791,JSAEDJH,SSOPA.,129225 > FEP,22576,SSJ,222,DEP
John -- use Perl; program fulfillment
|
Sat, 08 Jan 2005 05:27:04 GMT |
|
 |
Dan Haygoo #13 / 22
|
 matching records in a comma delimited file
Quote: > But join doesnt seem to work for where the patterns are > CHAR, CNPZY
I imagine your data is messed up then--look for embedded control characters with a hex editor. Can you think of any other reasonable explanation why, of 10 million records, joined to 700 keys, only a few odd cases would misbehave? Use one of the many awk solutions, or even the perl solution (that looks suspect to me, because the poster didn't indicate any testing), and see if you get the same results. - Dan
Quote: > I tried the Join method and it seems to work fine for most part of it. > But join doesnt seem to work for where the patterns are > CHAR, CNPZY > Lets take this example where the records in the data file are > CWEI,1995,NORAM,SECNRPS, -30.350000, -30.350000,YR > CNPZY,1995,CON,SECPDFNC, 1340.277644, 1340.947783,YR > CHAR,1995,US,UDGUUT, 2500.000000, 2500.000000,YR > HHL,1995,WOUSCON,RRCAPREA, -17714.285714, -17714.285714,YR > COHO,1995,REGROLL,UDNPOGDW, 6.400000, 6.400000,YR > The pattern file will have records as > CWEI > CNPZY > CHAR > HHL > COHO > Now the join > join -1 1 -2 1 -t, data pattern > gives the output as > CWEI,1995,NORAM,SECNRPS, -30.350000, -30.350000,YR > HHL,1995,WOUSCON,RRCAPREA, -17714.285714, -17714.285714,YR > COHO,1995,REGROLL,UDNPOGDW, 6.400000, 6.400000,YR > It has omitted the lines with CWEI and CNPZY as the first field. > Is there anything I am missing here. > Also when I sort > sort bigfile > bigsortedfile > It gives me a warning > Warning: use of redirected output for large sorts is usually slower than > specifying the output file name directly to > sort. > Please suggest. > Thanks > Ronnie
> > > Hi, > > > I have a comma delimited file as > > > REG,12619,GITT,LOM,52891 > > > GTU,15679211,GTSP,LOSN,82248 > > > TAP1132,16791,JSAEDJH,SSOPA.,129225 > > > SR,17551,AAFSSSA,89114120,AAU > > > FEP,22576,SSJ,222,DEP > > > The above file would have around 8-10 million records. > > > There is another file which has the records based on which I need to > > > filter the above file. > > > This will have around 600-700 records > > > REG > > > E > > > SR > > > SPSDDS > > > WSH > > > DSSL > > > I tried grep -f as > > > grep -f smallfile.txt bigfile.txt > filteredrecords.txt > > > But the problem is that grep matches the whole record for occurrences of > > > the string. > > > I would like the records in the small file to match the values in the > > > first field of the big file. > > > Hence I would want the result to be > > > REG,12619,GITT,LOM,52891 > > > SR,17551,AAFSSSA,89114120,AAU > > > But grep gives me > > > REG,12619,GITT,LOM,52891 > > > TAP1132,16791,JSAEDJH,SSOPA.,129225 > > > FEP,22576,SSJ,222,DEP > > Maybe this helps... > > 1. sort datafile > data.sorted > > 2. sort patternfile > pattern.sorted > > 3. join -1 1 -2 1 -t, data.sorted pattern.sorted > > I am somewhat uncertain about whether join works with > > giant files like Yours. > > Other than join, nawk, and perl come to mind.
|
Sat, 08 Jan 2005 11:11:10 GMT |
|
 |
Ronnie Your #14 / 22
|
 matching records in a comma delimited file
Hi Guys, This solution works for me gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt > filteredrecords.txt But I am not sure what exactly it does .I am a newbee and even though I tried looking up in tutorials / manuals to understand what exactly is happenning in the above statement I couldnt figure it out. Can you pleaes explain it. Thanks a lot Ronnie
|
Sat, 08 Jan 2005 23:55:51 GMT |
|
 |
Robert Stearn #15 / 22
|
 matching records in a comma delimited file
I'll try. It appears to use one gawk specific feature, which I am unfamiliar with and uses a lot of short hand: Quote:
> Hi Guys, > This solution works for me > gawk -F, 'ARGIND == 1 { x[$0];next } $1 in x' smallfile.txt bigfile.txt
111 22222222222 33333 4444 5555555 5 for records in the second file, if their first field is among the subscripts of x, take the default action (elided) of printing the record 4 go to the next record in the input 3 use the record content as a subscript for the array x; the unmentioned value of x[$0] is "" 2 if we are processing the first file mentioned in the argument list; this is the gawk extension I mentioned; for standard awk use NR==FNR 1 set the field separator to ',' Hope that helps. Quote: > filteredrecords.txt > But I am not sure what exactly it does .I am a newbee and even though I > tried looking up in tutorials / manuals to understand what exactly is > happenning in the above statement I couldnt figure it out. > Can you pleaes explain it. > Thanks a lot > Ronnie
|
Sun, 09 Jan 2005 00:24:09 GMT |
|
|
Page 1 of 2
|
[ 22 post ] |
|
Go to page:
[1]
[2] |
|