csv-file processing. 
Author Message
 csv-file processing.

Hello,

To process csv files I can define a separator like this in awk:
awk 'BEGIN { FS = "," } { print $2 }'

csv files are separated with a "," which is not the same as the awk-line above:
echo "field-one,field1","field-two,field2" | awk 'BEGIN { FS = "," } { print $2 }'

I was hoping for the result:
field-two,field2

instead I got:
field1

Which is correct if the separator was *only* a comma, but with lots of csv files, the separator is "," ( including the beginning " and ending " )

Now, my question: How can I tell awk to use "," as a separator, so csv files are processed the way I want/expect?

Tia,
Robert de Bock.



Tue, 26 Apr 2005 17:18:06 GMT  
 csv-file processing.

Quote:
> Hello,

> To process csv files I can define a separator like this in awk:
> awk 'BEGIN { FS = "," } { print $2 }'

> csv files are separated with a "," which is not the same as the awk-line above:
> echo "field-one,field1","field-two,field2" | awk 'BEGIN { FS = "," } { print $2 }'

> I was hoping for the result:
> field-two,field2

> instead I got:
> field1

> Which is correct if the separator was *only* a comma, but with lots of csv files, the separator is "," ( including the beginning "
and ending " )

> Now, my question: How can I tell awk to use "," as a separator, so csv files are processed the way I want/expect?

The short answer is that parsing csv files is hard, and if
you do a search in this NG, you will see how to do it.

The longer answer is there is an anomaly that might be
a bug in gnu awk, but this (below) works in Solaris nawk.

You are asking about a simple subset of csv files,
where each field is wrapped in quotes: "a","b","c","d"
So this ought to be a case for simply setting FS appropriately.

FS="\","\" works for all bar the first and last fields.
So to take care of the first one, we can add quote marks
at the start of line to FS: FS="^\"|\",\""
(Note that this increases the number of fields by one, as
there is now an empty field at the start of the line before
the initial quotation mark.)

Now that works in Solaris nawk, but not with gawk.
The reason is that the ^ in FS ties what follows to
the start of the line in nawk but to the start of the field
(separator) in Gnu awk. (Time to rtfm, methinks.)

So if you have Gnu awk, search the NG for a comprehensive
solution, and if you have nawk (or something that works
similarly), then you also need to deal with the final quote:
FS="^\"|\",\"|\"$"

John.



Tue, 26 Apr 2005 19:03:37 GMT  
 csv-file processing.

Quote:

> The longer answer is there is an anomaly that might be
> a bug in gnu awk, but this (below) works in Solaris nawk.

> You are asking about a simple subset of csv files,
> where each field is wrapped in quotes: "a","b","c","d"
> So this ought to be a case for simply setting FS appropriately.

> FS="\","\" works for all bar the first and last fields.
> So to take care of the first one, we can add quote marks
> at the start of line to FS: FS="^\"|\",\""
> (Note that this increases the number of fields by one, as
> there is now an empty field at the start of the line before
> the initial quotation mark.)

> Now that works in Solaris nawk, but not with gawk.
> The reason is that the ^ in FS ties what follows to
> the start of the line in nawk but to the start of the field
> (separator) in Gnu awk. (Time to rtfm, methinks.)

Oops, forgot to add an example that shows the different
behaviour of the two awk implementations. Take the line:
"a","b",""c"","d"
where the fields are: a, b, "c" (including quotes) and d
with nawk, whereas gawk has two (extra) empty fields
delimited by the adjacent quotation marks either side of
the c. (Ignoring empty fields at the start and end of line.)

Which is "correct", I cannot say.

John.



Tue, 26 Apr 2005 19:34:24 GMT  
 csv-file processing.
Hi,

On Fri, 8 Nov 2002 11:34:24 -0000

Quote:


> > The longer answer is there is an anomaly that might be
> > a bug in gnu awk, but this (below) works in Solaris nawk.

> > You are asking about a simple subset of csv files,
> > where each field is wrapped in quotes: "a","b","c","d"
> > So this ought to be a case for simply setting FS appropriately.

> > FS="\","\" works for all bar the first and last fields.
> > So to take care of the first one, we can add quote marks
> > at the start of line to FS: FS="^\"|\",\""
> > (Note that this increases the number of fields by one, as
> > there is now an empty field at the start of the line before
> > the initial quotation mark.)

You brougt me on an idea! (or maybe you made me understood the real issue...) As a solution, I now have this line:

echo "one,1","two,2","three,3" | sed -e 's/^"//;s/"$//' | awk -v d=\",\" 'BEGIN { FS = d ; OFS = FS } { print $2 }'

which substitute the leading and ending quote (") and results in:

---

one,1

two,2

three,3
---

Which is what I am looking for.

Quote:

> > Now that works in Solaris nawk, but not with gawk.
> > The reason is that the ^ in FS ties what follows to
> > the start of the line in nawk but to the start of the field
> > (separator) in Gnu awk. (Time to rtfm, methinks.)

> Oops, forgot to add an example that shows the different
> behaviour of the two awk implementations. Take the line:
> "a","b",""c"","d"
> where the fields are: a, b, "c" (including quotes) and d
> with nawk, whereas gawk has two (extra) empty fields
> delimited by the adjacent quotation marks either side of
> the c. (Ignoring empty fields at the start and end of line.)

This also seems to work with the lines above.

Thanks!

Robert de Bock.



Tue, 26 Apr 2005 20:37:41 GMT  
 csv-file processing.
Hi Tia,

Sorry, it'll be a little bit more complex.

If $1 has symbol "  inside need use regular expression to get everything
between two symbols ". Better not use Awk base fields parsing, but parse
everything inside Awk.

E.g.

  s=$0;

  /* remove everything before first double quote */
 i=match(s,"^[^\"]*\"");

 /* s1 in starts after 1st  */
 s1=substr(s,RSTART+RLENGTH);

 /* find second */
 i=match(s1,"[^\"]*\"");

/* get context between 1st & 2nd */
 field_one=substr(s1,1,RLENGTH-1);

/* next field: */
s=substr(s1,RSTART+RLENGTH);

 /* remove everything before first double quote of second field*/
 i=match(s,"^[^\"]*\"");
 /* s1 in starts after 1st  */
 s1=substr(s,RSTART+RLENGTH);
...

It should be not too complex to write loop around similar Awk fragment to
get all fields one by one. Need just remember that need get rid not only
from " but also from commas Awk regular expressions, function 'match' with
RSTART/RLENGTH are very power .

Regards,
-Joseph



Quote:
> Hello,

> To process csv files I can define a separator like this in awk:
> awk 'BEGIN { FS = "," } { print $2 }'

> csv files are separated with a "," which is not the same as the awk-line
above:
> echo "field-one,field1","field-two,field2" | awk 'BEGIN { FS = "," } {
print $2 }'

> I was hoping for the result:
> field-two,field2

> instead I got:
> field1

> Which is correct if the separator was *only* a comma, but with lots of csv

files, the separator is "," ( including the beginning " and ending " )
Quote:

> Now, my question: How can I tell awk to use "," as a separator, so csv

files are processed the way I want/expect?

- Show quoted text -

Quote:

> Tia,
> Robert de Bock.



Tue, 26 Apr 2005 23:47:13 GMT  
 csv-file processing.

Quote:
> Hello,

> To process csv files I can define a separator like this in awk:
> awk 'BEGIN { FS = "," } { print $2 }'

> csv files are separated with a "," which is not the same as the awk-line above:
> echo "field-one,field1","field-two,field2" | awk 'BEGIN { FS = "," } { print $2 }'

> I was hoping for the result:
> field-two,field2

> instead I got:
> field1

> Which is correct if the separator was *only* a comma, but with lots of csv files, the separator is "," ( including the beginning " and ending " )

> Now, my question: How can I tell awk to use "," as a separator, so csv files are processed the way I want/expect?

If you are interested in parsing CSV files in AWK I suggest you check
out "setcsv" which can be found at:-

http://www.deja.com/=dnc/getdoc.xp?AN=603309980

...it will even parse CSV files produced by Micosoft Excell, which
allows for newlines to be embedded in quoted fields!!

Regards,
   =Adrian=



Fri, 29 Apr 2005 16:40:27 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. (CSV) text file processing...

2. PEP305 csv package: from csv import csv?

3. PEP305 csv package: from csv import csv?

4. Parsing CSV file outputting desired information into multiple unique files

5. Revisting the CSV Challange -- what is the most eloquent way to parse a CSV

6. Importing .csv files into APL

7. Importing .csv file into APL

8. editing address data in a .csv file...

9. splitting fields in .csv files...

10. How to handle comma in the fields winthin a csv file using awk

11. Parsing CSV files

12. Problem parsing CSV file

 

 
Powered by phpBB® Forum Software