Newbie help needed - using awk to create a CSV file 
Author Message
 Newbie help needed - using awk to create a CSV file

Hi,

Sorry if this has been dealt with previously but I've had a look
through the posts here and can't see anything which approximates to
what I want to do. I've never used 'awk' for anything over than the
simplest actions before now, but I feel like it ought to be possible
to use 'awk' for the more complicated file processing requirement
detailed below. Unfortunately, I am slowly driving myself insane
trying (and failing) to do it!

I can most easily explain what I'm trying to do by way of an example.
I want to process a file containing data (in this example having 9
fields per record, each field on a separate line, each record followed
by a blank line) like this:

1
19891214
1
m41r.
1
pack
a3
apply three times a day
11

1
19930304
1
e319.
75
mls

5ml spoon, 3 times a day
11

1
19910514
1
e155.
100
mls
s4
5ml spoon, 4 times a day
54

1
19910807
1
la1a.
1
o.p.

apply twice a day
11

etc. etc.

and produce a standard CSV file like this:

1,19891214,1,"m41r.",1,"pack","a3","apply three times a day",11
1,19930304,1,"e319.",75,"mls","","5ml spoon, 3 times a day",11
1,19910514,1,"e155.",100,"mls","s4","5ml spoon, 4 times a day",54
1,19910807,1,"la1a.",1,"o.p.","",apply twice a day",11
etc.etc.

I don't see how I can use multi-line processing because 'awk' wrongly
thinks the empty fields (which do occur in the data, as illustrated in
the second and fourth records above) mark the end of the record.

Things are further complicated by needing to put double quotes, which
aren't in the original file, around all the alphanumeric fields.

The "best" I've been able to achieve is:

1,1,1,"1",1,"1","1","1",1
19891214,19891214,19891214,"19891214",19891214,"19891214","19891214","19891214",19891214
1,1,1,"1",1,"1","1","1",1
m41r.,m41r.,m41r.,"m41r.",m41r.,"m41r.","m41r".,"m41r.",m41r.
etc etc.

which isn't terribly useful! No matter what I do, I can't stop 'awk'
from duplicating each field like this in the output.

I'd use 'C', but I'm not a 'C' programmer. The data files to be
processed will be pretty large (20-200mb), so the processing would
need to be reasonably quick.

Any help would be much appreciated!

Cheers

Chris Wright



Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file


Quote:
>Hi,

>Sorry if this has been dealt with previously but I've had a look
>through the posts here and can't see anything which approximates to
>what I want to do. I've never used 'awk' for anything over than the
>simplest actions before now, but I feel like it ought to be possible
>to use 'awk' for the more complicated file processing requirement
>detailed below. Unfortunately, I am slowly driving myself insane
>trying (and failing) to do it!

>I can most easily explain what I'm trying to do by way of an example.
>I want to process a file containing data (in this example having 9
>fields per record, each field on a separate line, each record followed
>by a blank line) like this:

>1
>19891214
>1
>m41r.
>1
>pack
>a3
>apply three times a day
>11

BEGIN { ORS="," }
NR % 9 { print p();next }
{ printf("%s\n",p()) }
function p() { return /^[0-9]+$/ ? $0 : ("\""$0"\"") }


Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file
This is my attempt.  I thought Kenny McCormack's use of ORS really neat, and this is longer,
but here goes:
Usage: gawk -f cdata.awk cdata.txt > cdata.out
where cdata.awk contains:
BEGIN {
 do{
   for(i = 1; i <=10; i++){
    if((neof = getline line[i]) == 0) {exit;}
    if(i<10) {printf("\"%s\"%s", line[i], (i<9) ? "," : "\n" );}
   }
 }while(neof)

Quote:
}

Philip


Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file

Quote:



>>Hi,

>>Sorry if this has been dealt with previously but I've had a look
>>through the posts here and can't see anything which approximates to
>>what I want to do. I've never used 'awk' for anything over than the
>>simplest actions before now, but I feel like it ought to be possible
>>to use 'awk' for the more complicated file processing requirement
>>detailed below. Unfortunately, I am slowly driving myself insane
>>trying (and failing) to do it!

>>I can most easily explain what I'm trying to do by way of an example.
>>I want to process a file containing data (in this example having 9
>>fields per record, each field on a separate line, each record followed
>>by a blank line) like this:

>>1
>>19891214
>>1
>>m41r.
>>1
>>pack
>>a3
>>apply three times a day
>>11

>BEGIN { ORS="," }
>NR % 9 { print p();next }
>{ printf("%s\n",p()) }
>function p() { return /^[0-9]+$/ ? $0 : ("\""$0"\"") }

Oops - I missed that there was a blank line after each block of 9.
So, it should be:

BEGIN { ORS="," }
NR % 9 { print p();next }
{ printf("%s\n",p());getline;NR-- }
function p() { return /^[0-9]+$/ ? $0 : ("\""$0"\"") }



Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file
And another attempt, this time omitting quotation marks around numbers:
Usage: gawk -f cdata.awk cdata.txt > cdata.out
where cdata.awk contains:
BEGIN {
 do{
   for(i = 1; i <=10; i++){
    if((neof = getline line[i]) == 0) {exit;}
    if(i==10) { continue;}
      if(line[i] ~ /^[0-9][0-9]*$/) {
       printf("%s%s", line[i], (i<9) ? "," : "\n" );
      } else {
       printf("\"%s\"%s", line[i], (i<9) ? "," : "\n" );
      }
    }
 }while(neof)

Quote:
}

Philip


Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file

Quote:

>BEGIN { ORS="," }
>NR % 9 { print p();next }
>{ printf("%s\n",p()) }
>function p() { return /^[0-9]+$/ ? $0 : ("\""$0"\"") }

Thanks, Kenny. I don't understand it, but I have tried it!
Unfortunately, my implementation of 'awk' (SCO Open Server) complains
that the first '/' after "return" is a syntax error.

Chris W



Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file


Quote:


>>BEGIN { ORS="," }
>>NR % 9 { print p();next }
>>{ printf("%s\n",p()) }
>>function p() { return /^[0-9]+$/ ? $0 : ("\""$0"\"") }

>Thanks, Kenny. I don't understand it, but I have tried it!
>Unfortunately, my implementation of 'awk' (SCO Open Server) complains
>that the first '/' after "return" is a syntax error.

Try replacing /.../ with $0 ~ /.../

I knew that part would be implementation defined and problematic.  Bottom line
is: it should work, and, in GAWK, it does.  Remember: "In GAWK, we trust".

The following is a quick test:

gawk 'function foo(){ return /foo/ ? "Yes" : "No" }{print foo()}'



Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file
On Fri, 25 Aug 2000 19:33:17 +0100, Philip Smith

Quote:

>And another attempt, this time omitting quotation marks around numbers:
>Usage: gawk -f cdata.awk cdata.txt > cdata.out
>where cdata.awk contains:
>BEGIN {
> do{
>   for(i = 1; i <=10; i++){
>    if((neof = getline line[i]) == 0) {exit;}
>    if(i==10) { continue;}
>      if(line[i] ~ /^[0-9][0-9]*$/) {
>       printf("%s%s", line[i], (i<9) ? "," : "\n" );
>      } else {
>       printf("\"%s\"%s", line[i], (i<9) ? "," : "\n" );
>      }
>    }
> }while(neof)
>}

Great stuff - this works! At least it does if some of the numbers are
altered as follows:

BEGIN {
 do{
   for(i = 1; i <=12; i++){
    if((neof = getline line[i]) == 0) {exit;}
    if(i==12) { continue;}
      if(line[i] ~ /^[0-9][0-9]*$/) {
       printf("%s%s", line[i], (i<11) ? "," : "\n" );
      } else {
       printf("\"%s\"%s", line[i], (i<11) ? "," : "\n" );
      }
    }
 }while(neof)

Quote:
}

*Very* much appreciated, Phil. Can you stick some comments in to
explain what it does? I might as well learn something as well as
picking peoples' brains!

It took 16 minutes to process a 57mb file. I could really do with the
files being processed quicker than that. Is Kenny's script likely to
run quicker if it can be persuaded to work?

Chris W



Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file

Quote:

>Try replacing /.../ with $0 ~ /.../

Now we're getting somewhere! It's happy now, although I needed to
change this line:

Quote:
>NR % 9 { print p();next }

to

NR % 12 { print p();next }

to get it putting the newline in the right place. I've just realised
why this is! I took 2 fields per record out of the example I posted
earlier in order to make the CSV examples fit on one line, so these
adjustments I'm having to make are down to me! Doh!

Anyway, Kenny's code prints a null string for the blank line at the
end of each record (there are only 11 actual fields). I want to ignore
the blank line rather than be print it, with the newline coming after
the 11th field in the output.

I'm just running it now to see how long it takes on the 57mb file.
It's just finished whilst I was sorting out the rest of this message
and it took just 6 minutes! Amazing how much difference using a few
less lines of code can make!

Thanks for taking the trouble to go through several iterations with
this, Kenny.

Quote:
>I knew that part would be implementation defined and problematic.  Bottom line
>is: it should work, and, in GAWK, it does.  Remember: "In GAWK, we trust".

Well, SCO doesn't have 'gawk', just 'awk'!

Quote:
>The following is a quick test:
>gawk 'function foo(){ return /foo/ ? "Yes" : "No" }{print foo()}'

It doesn't like that either :-(

Cheers

Chris W



Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file


Quote:


>>Try replacing /.../ with $0 ~ /.../

>Now we're getting somewhere! It's happy now, although I needed to
>change this line:

>>NR % 9 { print p();next }

>to

>NR % 12 { print p();next }

>to get it putting the newline in the right place. I've just realised
>why this is! I took 2 fields per record out of the example I posted
>earlier in order to make the CSV examples fit on one line, so these
>adjustments I'm having to make are down to me! Doh!

>Anyway, Kenny's code prints a null string for the blank line at the
>end of each record (there are only 11 actual fields). I want to ignore
>the blank line rather than be print it, with the newline coming after
>the 11th field in the output.

It looks like you didn't see my (almost immediate) followup to that post.

I noticed shortly after posting that I had not accounted for the blank line
between the "records".  So, I posted a correction (basically, all it does is
throw the extra line away, and adjust NR accordingly so that the rest of the
algorithm works).

BTW, if at all possible, you should get and compile a copy of GAWK for
yourself.  You'll be glad you did.



Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file

Quote:

>It looks like you didn't see my (almost immediate) followup to that post.

Thanks, Kenny. I missed it first time around but it is here. I'm
having to shut my systems down now due to a thunderstorm!

Chris W



Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file

Quote:

> Great stuff - this works! At least it does if some of the numbers are
> altered as follows:
> Can you stick some comments in to explain what it does? I might as well
> learn something as well as picking peoples' brains!   Chris W

Here are my comments on the script:

# Awk scripts can have any number of sections, which are matched up against
each input line. # Also, there is a BEGIN{} section processed first, and and
END{} section processed last. This # script uses only a BEGIN{} section and
quits when it reaches the end of the file, calling {exit;}

Quote:
> BEGIN {

# This loop --  do { ... } while(neof) -- runs until neof (not end of file)
equals zero at the end of
# the data file

Quote:

>  do{

# the for() loop reads in 12 lines using getline and stores the line number
in i (1 to 12)
# and its data in an array element line[i] until the end of the file,
# when getline returns a value of zero which is assigned to neof
# and this is compared to zero and if zero calls exit to leave the for()
loop, still within do..while

Quote:

>    for(i = 1; i <=12; i++){

# get next line and store in array element line[i]

Quote:

>     if((neof = getline line[i]) == 0) {exit;}

# do not print anything for line 12 which is blank, continue for loop
# -- do not replace it by {break;} to leave for loop immediately (untested)
# -- as the last record might then not be printed

Quote:

>     if(i==12) { continue;}

# if the value of the element of matches zero or more digits print first
pattern else print second
# -- change the asterisk to a plus to ensure a match with one or more digits

Quote:

>       if(line[i] ~ /^[0-9][0-9]*$/) {

# printf()  is a formatted print which uses a format string as the first
item with "%s"
# matching each subsequent string (%d for integer, %f for float).
# the ? operator is borrowed from C -- test-condition ? action if true :
action if false
# - gives a comma if line number i is less than 11 or else a special string
indicating newline "\n"
# the second printf() has quotation marks added with backslash escapes so
that they are
# printed and not regarded as string delimiters, used for alphabetic
(non-numeric) lines.

Quote:

>        printf("%s%s", line[i], (i<11) ? "," : "\n" );
>       } else {
>        printf("\"%s\"%s", line[i], (i<11) ? "," : "\n" );
>       }
>     }

# end of for(){} loop

Quote:

>  }while(neof)

# end of do{} loop

Quote:

> }

# end of program - program ends at {exit;} when end of file is reached.

I hope that helps.  Kenny's script is really neat, and should be more
efficient - this
script has a lot of loops and tests.

Philip



Tue, 11 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file
On Fri, 25 Aug 2000 22:53:56 +0100, Philip Smith

Quote:

>>     if((neof = getline line[i]) == 0) {exit;}

One of my attempts had getline within a for loop in it but, for some
reason, I found that it didn't get the next line, so it went round the
for loop printing the current line over and over again. I didn't get
the next line read until control returned to the outer while loop.

Quote:
># -- change the asterisk to a plus to ensure a match with one or more digits
>>       if(line[i] ~ /^[0-9][0-9]*$/) {

The only thing I'm not sure of here (and also the equivalent in
Kenny's script) is what "*$" and "+$" signify. My understanding, if
"*$" was left out, is that it would match any pattern which started
with 2 digits. This wouldn't be sufficient as an alphanumeric might
start with 2 digits. I think the "$" is the equivalent of "^" but
applies to the end of the string rather than the beginning. The "*"
must be a wildcard then. Presumably, therefore, it refers to
everything else in the string after the first 2 digits. If this is the
case, why is the "$" necessary? Does the "*" mean to repeat the
previous [0-9][0-9] pattern match for the rest of the string? If so,
does that mean it will only work for pairs of digits (and, maybe, fall
foul if there are an odd number of characters in the string), or will
it only match on the second [0-9] as that was the last pattern match
specified before the "*"? How is "+$" different to "*$"?

Quote:
>>        printf("%s%s", line[i], (i<11) ? "," : "\n" );
>>       } else {
>>        printf("\"%s\"%s", line[i], (i<11) ? "," : "\n" );
>>       }
>>     }

Did you use printf here rather then print because printf allows the
use of the conditional ? operator whereas print wouldn't?

Quote:
>I hope that helps.

Thanks to both yourself and Kenny for taking the trouble (and quickly)
to help me out, and for both being so patient! Thanks to you, Philip,
for the the in-depth explanation of your script.

I think what I need to come to understand is what it is about my
script which caused getline not to read a new line within a for loop,
when it works OK in Philip's script. I'm think I'm too embarrassed to
share my clumsy, ignorant attempt with you given the elegance of the 2
solutions posted here! You must be writing in awk in your sleep! :-)

Thanks again

Chris



Wed, 12 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file

Quote:

> ># -- change the asterisk to a plus to ensure a match with one or more digits
> >>       if(line[i] ~  /^[0-9][0-9]*$/) {

> The only thing I'm not sure of here (and also the equivalent in
> Kenny's script) is what "*$" and "+$" signify.

You're fairly close. $ does indeed refer to the end of the string, but the *
character is not a wild card as such, but means that the proceeding regexp
([0-9] in this case) may be matched multiple times. So in effect what the
pattern /^[0-9][0-9]*$/ means, is that the entire string must be composed
of the digits 0-9 from start (^) to end ($).

The + means something very similar, except that it must match the pattern
at least once. Hence the patterns /^[0-9][0-9]*$/ and /^[0-9]+$/ are
equivalent.

Clear (as mud probably)?

Tristan.



Wed, 12 Feb 2003 03:00:00 GMT  
 Newbie help needed - using awk to create a CSV file
On Sat, 26 Aug 2000 12:51:17 +0100, Tristan Quaife

Quote:

>You're fairly close. $ does indeed refer to the end of the string, but the *
>character is not a wild card as such, but means that the proceeding regexp
>([0-9] in this case) may be matched multiple times. So in effect what the
>pattern /^[0-9][0-9]*$/ means, is that the entire string must be composed
>of the digits 0-9 from start (^) to end ($).

>The + means something very similar, except that it must match the pattern
>at least once. Hence the patterns /^[0-9][0-9]*$/ and /^[0-9]+$/ are
>equivalent.

Thanks, Tristan. I understand it much better now.

Cheers

Chris



Sat, 15 Feb 2003 03:00:00 GMT  
 
 [ 15 post ] 

 Relevant Pages 

1. How to handle comma in the fields winthin a csv file using awk

2. Help with creating a CSV file

3. need help reading csv file into lisp

4. awk newbie needs to parse xml file

5. Importing *.csv files with AWK ???

6. newbie needs help with multiple expressions in awk

7. awk newbie needs help

8. AWK newbie is looking for a AWK help with his 1st program

9. creating a CSV file

10. Newbie Needs Help With Creating DLL

11. Q:creating a list from a windows .csv file

12. need help using AWK

 

 
Powered by phpBB® Forum Software