Dealing with commas as a field separator AND possibly within a field 
Author Message
 Dealing with commas as a field separator AND possibly within a field

I need to parse a comma delimited file of names and addresses to import
into our mail server.  I wrote a quick gawk script to do it.  The only
problem is that some of the fields have commas within them, i.e.
"University of Alaska, Anchorage".  Of course, I'm getting bad output on
these entries because I'm using the comma as the field separator.  The
fields that do have the extra commas are enclosed in quotes.
Anyone know an easy way to get around this?  Can I make quotes a
special character somehow?  Am I just looking for an easy way out when
there is none? :-)
I'm afraid I'm going to have to getline the whole file and parse each
line to account for this one case.
--
Jeff
Remove NOSPAM to email me.


Sat, 25 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field


Quote:
>I need to parse a comma delimited file of names and addresses to import
>into our mail server.  I wrote a quick gawk script to do it.  The only
>problem is that some of the fields have commas within them, i.e.
>"University of Alaska, Anchorage".  Of course, I'm getting bad output on
>these entries because I'm using the comma as the field separator.  The
>fields that do have the extra commas are enclosed in quotes.
>Anyone know an easy way to get around this?  Can I make quotes a
>special character somehow?  Am I just looking for an easy way out when
>there is none? :-)
>I'm afraid I'm going to have to getline the whole file and parse each
>line to account for this one case.

If all your fields are enclosed in quotes and sepatated by commas,
then just change the separators to something else, such as ###
before piping it to gawk using "###" as the field separator.

Something like this:

sed -e 's/","/"###"/g' infile |gawk 'BEGIN{FS="###"}{print $2, $3, $1}'

for this infile:


it outputs:


If your data is different, i.e. not all fields are quoted, then
the problem is different, and I addressed something similar in one
of my posts to this newsgroup or a comp.unix newsgroup recently.  
Basicly, the idea was to change the commas to something else only
if they followed an even number (zero considered even) of double
quotes.

If you need something like that, followup in this newsgroup.

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Sat, 25 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field

Quote:

>I need to parse a comma delimited file of names and
>addresses to import into our mail server.  I wrote a quick
>gawk script to do it.  The only problem is that some of
>the fields have commas within them, . . .
..
> . . . The fields that do have the extra commas are
>enclosed in quotes.

The format is called CSV, though there are several
variations on the theme. First, it's not possible to
construct FS in such a way that it handles this, so you'd
need a function to do this in awk.

CSV file parsing has been dealt with several times in
comp.lang.awk. You can search through Deja.com (as in
http://www.deja.com) for CSV in comp.lang.awk. You'll get
several hits. Concentrate on the most recent postings from
Jim Monty and me.

Which to choose? If you have relatively few double quoted
fields, my approach is faster. If you have relatively many
double quoted fields, Jim Monty's is faster. Both are much
faster than the character-by-character state machines
you'll also find.

* Sent from AltaVista http://www.altavista.com Where you can also find related Web Pages, Images, Audios, Videos, News, and Shopping.  Smart is Beautiful



Sat, 25 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field

says...

Chuck,
Thanks for the reply, that will do the trick.  I don't know why I was
overlooking the obvious solution of changing the delimiter in the file.  
I'll have to plead stupidity on this one :-)
I was actually able to save the file with a tab delimiter instead and
parse it.

Jeff

Quote:

> If all your fields are enclosed in quotes and sepatated by commas,
> then just change the separators to something else, such as ###
> before piping it to gawk using "###" as the field separator.

> Something like this:

> sed -e 's/","/"###"/g' infile |gawk 'BEGIN{FS="###"}{print $2, $3, $1}'

> for this infile:


> it outputs:


> If your data is different, i.e. not all fields are quoted, then
> the problem is different, and I addressed something similar in one
> of my posts to this newsgroup or a comp.unix newsgroup recently.  
> Basicly, the idea was to change the commas to something else only
> if they followed an even number (zero considered even) of double
> quotes.

> If you need something like that, followup in this newsgroup.

> Chuck Demas
> Needham, Mass.

> --
>   Eat Healthy    |   _ _   | Nothing would be done at all,

>   Die Anyway     |    v    | That no one could find fault with it.


--
Jeff
Remove NOSPAM to email me.


Sat, 25 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field

Quote:

> I need to parse a comma delimited file of names and addresses to import
> into our mail server.  I wrote a quick gawk script to do it.  The only
> problem is that some of the fields have commas within them, i.e.
> "University of Alaska, Anchorage".  Of course, I'm getting bad output on
> these entries because I'm using the comma as the field separator.  The
> fields that do have the extra commas are enclosed in quotes.
> Anyone know an easy way to get around this?  Can I make quotes a
> special character somehow?  Am I just looking for an easy way out when
> there is none? :-)
> I'm afraid I'm going to have to getline the whole file and parse each
> line to account for this one case.
> --
> Jeff
> Remove NOSPAM to email me.

Are there any other things you can take advantage of like:

 - spaces around the delimiting commas ...

file junk:
"a,b" , c , d
e , f , g

awk 'BEGIN{FS=" , "}{print $2}' junk

 - is the number of fields always constant ...

if so you could check the number of fields and work from the end of the
line back using NF-n

Just a couple of suggestions ... a sample of the file would be nice ;-)

Cheers,
Ted



Sat, 25 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field


Quote:

>says...

>Chuck,
>Thanks for the reply, that will do the trick.  I don't know why I was
>overlooking the obvious solution of changing the delimiter in the file.
>I'll have to plead stupidity on this one :-)
>I was actually able to save the file with a tab delimiter instead and
>parse it.

Amazingly enough, that's actually the best solution to the whole CSV
problem - assuming you can do it (i.e., that you have control over the data
source and production).

In a former job, we changed from using CSV to BSV, by switching to
using "|" as the delimiter on all of our files.  The standard line
(comment) was that, "When you think about it, comma is just about the
worst possible choice for a delimiter."

No wonder, Microsoft endorses it (*).

(*) If you actually go back and do a deja on the old threads, you will find
much talk about "We must do it this way, because that's the way Microsoft
does it."



Sat, 25 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field

..

Quote:
>(*) If you actually go back and do a deja on the old
>threads, you will find much talk about "We must do it this
>way, because that's the way Microsoft does it."

One interpretation. Another interpretation is, "We have to
do it the Microsoft way if it's coming from or going to a
Microsoft application." Not always the case, but when it is
(and it's highly likely on Windows machines) fewer problems
are likely when you adopt/adapt to Microsoft specifications.

Also, much of the Microsoft-or-not argument centered around
how to handle ill-formed records like

1,2,a"bc,"def",3

as well as whether or not to support newlines within quoted
fields - a necessary consideration if data is being written
by spreadsheets or some databases.

Ill-formed records lead to two specification questions in
sequence. Should they generate runtime errors? If not, how
should any record, no matter how ill-formed, be handled?
It's this second question that led to my posts recommending
the Microsoft CSV spec. You can ignore it if you answer
question 1 affirmatively; else, you need to follow some
spec. Since I work on WinTel machines for the most part,
I'm biased toward following Microsoft specs (a bias based
not on love for Microsoft but on the likelihood that either
the input is coming from or output heading to a Microsoft
application). If you have a practical alternative, fine.

BTW, Lotus foisted the original CSV file format on computer
users - as in 'formatted' text files that could be imported
'as numbers' into separate cells (fields) on each row
(record). Lotus spreadsheets could only read such files
directly; they couldn't save worksheets in CSV format
directly. Microsoft's only addition was supporting CSV as a
format in which worksheets could be saved. So blame Lotus
Development Corp for the CSV format.

* Sent from AltaVista http://www.altavista.com Where you can also find related Web Pages, Images, Audios, Videos, News, and Shopping.  Smart is Beautiful



Sat, 25 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field


Quote:
> BTW, Lotus foisted the original CSV file format on computer
> users - as in 'formatted' text files that could be imported
> 'as numbers' into separate cells (fields) on each row
> (record). Lotus spreadsheets could only read such files
> directly; they couldn't save worksheets in CSV format
> directly. Microsoft's only addition was supporting CSV as a
> format in which worksheets could be saved. So blame Lotus
> Development Corp for the CSV format.

I stand subject to correction by those more knowledgeable than I, but I
thought CSV preceded Lotus by many years. I have some vague recollections of
using commas to do "comma delimited fields" (a misnomer, CSV is much better)
from some time before Lotus Development Corporation was even formed.

Sam



Sat, 25 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field

Quote:



>>BTW, Lotus foisted the original CSV file format on computer
>>users - as in 'formatted' text files that could be imported
>>'as numbers' into separate cells (fields) on each row
>>(record). Lotus spreadsheets could only read such files
>>directly; they couldn't save worksheets in CSV format
>>directly. Microsoft's only addition was supporting CSV as a
>>format in which worksheets could be saved. So blame Lotus
>>Development Corp for the CSV format.

>I stand subject to correction by those more knowledgeable than I, but I
>thought CSV preceded Lotus by many years. I have some vague recollections
of
>using commas to do "comma delimited fields" (a misnomer, CSV is much
better)
>from some time before Lotus Development Corporation was even formed.

I could be wrong too. The Lotus 123 format was commas as field separator,
numbers were unquoted and could be in any %*.*g format, empty fields were
represented by /, *,/, and everything else had to be inside double quotes
(string delimiters) with double quotes inside strings represented by pairs
of double quotes. Maybe SAS used this as a standard I/O format, but it's my
impression unix generally used other field separator characters rather than
comma.


Sat, 25 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field


Quote:
> I need to parse a comma delimited file of names and addresses to
import
> into our mail server.  I wrote a quick gawk script to do it.  The
only
> problem is that some of the fields have commas within them, i.e.
> "University of Alaska, Anchorage".  Of course, I'm getting bad output
on
> these entries because I'm using the comma as the field separator.
The
> fields that do have the extra commas are enclosed in quotes.
> Anyone know an easy way to get around this?  Can I make quotes a
> special character somehow?  Am I just looking for an easy way out
when
> there is none? :-)
> I'm afraid I'm going to have to getline the whole file and parse each
> line to account for this one case.

Yes, I think so too. The format of the file you describe is commonly
known as CSV, comma separated values. It's not a format that awk or any
regular expression based tool is very suited to deal with. You can't
construct a regular expression that handles it. Or not all of it
anyway. You need a parser as well. Luckily you are not alone with this
problem. Search the archives for comp.lang.awk and you'll find that
there have been posted full blown solutions. I think it was Harlan that
posted the most comprehensive suggestion. But if you can. Try change
your input format. Maybe you can get it with tab separators instead?

Regards,
/Peter
--
-= Spam safe(?) e-mail address: pez68 at netscape.net =-

Sent via Deja.com http://www.deja.com/
Before you buy.



Sun, 26 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field
Thanks for everyone's suggestions, they were most helpful.  I'll check
out some of the archives too as suggested.  I'm working in a Windows
environment right now and haven't used awk in several years and had
forgotten a lot of what I used to know.

One more question:
I'm trying to use the "sub" function to remove some quotes and commas and
it is not working as I expected.

Here's what I tried:

sub(/"/,"",$1)
print $1

The input was:
"University of Alaska, Anchorage"
The output was:
"University of Alaska, Anchora"University of Alaska, Anchorage"

I thought perhaps a null string replacement wasn't valid, but using a
space instead yielded the same results.  Am I trying to do something
invalid?

I'm using gawk 3.0.3 under NT4 btw.

Jeff



Sun, 26 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field
Quote:

> Thanks for everyone's suggestions, they were most helpful.  I'll check
> out some of the archives too as suggested.  I'm working in a Windows
> environment right now and haven't used awk in several years and had
> forgotten a lot of what I used to know.

> One more question:
> I'm trying to use the "sub" function to remove some quotes and commas and
> it is not working as I expected.

> Here's what I tried:

> sub(/"/,"",$1)
> print $1

> The input was:
> "University of Alaska, Anchorage"
> The output was:
> "University of Alaska, Anchora"University of Alaska, Anchorage"

it works for me (gawk-3.0.3/IRIX)
in : "University of Alaska, Anchorage"
out: University

perhaps your awk is treating /"/,"" together as a string, which somehow
(no idea how) matches the 'ge"' at the end of your input. it would
mean that before the sub(), $1 is the whole sentence, not just the first word,
is that right?

it might be safer to use sub(/\"/,"",$1) or sub("\"","",$1)
although I think it should not be necessary.

- Show quoted text -

Quote:
> I thought perhaps a null string replacement wasn't valid, but using a
> space instead yielded the same results.  Am I trying to do something
> invalid?

> I'm using gawk 3.0.3 under NT4 btw.

> Jeff



Sun, 26 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field

...

Quote:
>> Here's what I tried:

>> sub(/"/,"",$1)
>> print $1

>> The input was:
>> "University of Alaska, Anchorage"
>> The output was:
>> "University of Alaska, Anchora"University of Alaska, Anchorage"
>it works for me (gawk-3.0.3/IRIX)
>in : "University of Alaska, Anchorage"
>out: University

It certainly sounds like a shell-quoting problem - made worse by the fact
that he is on NT (Blech!) and doesn't even have a shell, per se.

Here are a couple of tricks that can help you avoid using quotes on DOS/NT
command lines:
        1) Use \042 for " (in the reg exp for sub())
        2) Use x (where x is an otherwise unused variable name) for ""



Sun, 26 May 2002 03:00:00 GMT  
 Dealing with commas as a field separator AND possibly within a field


Quote:

> ...
> >> Here's what I tried:

> >> sub(/"/,"",$1)
> >> print $1

> >> The input was:
> >> "University of Alaska, Anchorage"
> >> The output was:
> >> "University of Alaska, Anchora"University of Alaska, Anchorage"
> >it works for me (gawk-3.0.3/IRIX)
> >in : "University of Alaska, Anchorage"
> >out: University

> It certainly sounds like a shell-quoting problem - made worse by the fact
> that he is on NT (Blech!) and doesn't even have a shell, per se.

> Here are a couple of tricks that can help you avoid using quotes on DOS/NT
> command lines:
>    1) Use \042 for " (in the reg exp for sub())
>    2) Use x (where x is an otherwise unused variable name) for ""

Actually, I neglected to include a BEGIN FS="\t" in the original posted
code, although that was irrelevent to the problem.  Without the tab FS I
still get:
"University
with every combination I try: /"/ ,  /\034/ , /\"/ or whatever.
  I verified the file wasn't unicode and also tried it on a Win98 machine
too.  I can only conclude there is a bug with the sub function in the
win32 implementation I have.  I downloaded this package of gnu utilities
in win32 binary format some time ago and forgot where I got it from.  

If anyone knows of a compiled win32 awk I could grab somewhere it would
be appreciated.  I know there used to be a gnuish project, but I checked
wuarchive, oakland, and another site or two and don't see the files
around any more.

--
Jeff
Remove NOSPAM to email me.



Sun, 26 May 2002 03:00:00 GMT  
 
 [ 14 post ] 

 Relevant Pages 

1. getting fields NOT comma delimited with commas inside

2. Hexadecimal Field Separator

3. Multiple field separators for paranthesis not working

4. multiple field separators in AWK

5. Field Separator

6. Field Separator larger than one character

7. Field Separator not working

8. Quoted field separators

9. backslash as field separator

10. field separator

11. Problem with Field Separators

12. AWK with 2 field separators

 

 
Powered by phpBB® Forum Software