How can I separate each field in a 'csv' files ? 
Author Message
 How can I separate each field in a 'csv' files ?

I would like to separate lines in a comma-separated csv files into fields.
However, some of the fields in the line are double quoted, and I want to
treat all the things within the double quote as one field. How can I do this
? Some of those double quoted fields also contain commas.
For example,
aaa,bbb,ccc,dddd,"eee, fff, ggg",hhhh
Fields:
1. aaa
2. bbb
3. ccc
4. ddd
5. eee,fff,ggg
6. hhh

I try to set the FS = { ,"}, but it doesn't work. It just separate the words
within the double quote. Would some body give me advice for this ? Thanks.




Thu, 01 Nov 2001 03:00:00 GMT  
 How can I separate each field in a 'csv' files ?
this should work as long as the quoted text has at least one comma, so
it won't handle, for example,

aaa,"bbb",ccc

stasinos

#!/usr/bin/awk -f

{
  # real field counter
  fieldcounter = 0;

  # n is the number of `fields' split finds
  n = split( $0, arr, "," );

  # flag is 1 when inside a quoted field, 0 otherwise
  flag = 0;

  # str accumulates quoted `fields' and is printed
  # when the closing quotes are encountered

  # go through comma-separated `fields'
  for( i=1; i<=n; ++i ) {
    # look for quotes
    if( gsub( "\"", "", arr[i] ) ) {
      # opening quote, initialise str
      if( !flag ) {
        flag = 1;
        str = arr[i];
      }
      # closing quote, print str as well as current `field'
      else {
        flag = 0;
        ++fieldcounter;
        print fieldcounter ". " str "," arr[i];
      }
    }
    else {
      # normal field, just print
      if( !flag ) {
        ++fieldcounter;
        print fieldcounter ". " arr[i];
      }
      # quoted `field', append to str
      else str=str "," arr[i];
    }
  }

  #separate records with an empty line
  print "";

Quote:
}

> I would like to separate lines in a comma-separated csv files into fields.
> However, some of the fields in the line are double quoted, and I want to
> treat all the things within the double quote as one field. How can I do this
> ? Some of those double quoted fields also contain commas.
> For example,
> aaa,bbb,ccc,dddd,"eee, fff, ggg",hhhh
> Fields:
> 1. aaa
> 2. bbb
> 3. ccc
> 4. ddd
> 5. eee,fff,ggg
> 6. hhh
> I try to set the FS = { ,"}, but it doesn't work. It just separate the words
> within the double quote. Would some body give me advice for this ? Thanks.




Fri, 02 Nov 2001 03:00:00 GMT  
 How can I separate each field in a 'csv' files ?

Quote:

>I would like to separate lines in a comma-separated csv files into fields.
>However, some of the fields in the line are double quoted, and I want to
>treat all the things within the double quote as one field. How can I do this
>? Some of those double quoted fields also contain commas.
>For example,
>aaa,bbb,ccc,dddd,"eee, fff, ggg",hhhh
>Fields:
>1. aaa
>2. bbb
>3. ccc
>4. ddd
>5. eee,fff,ggg
>6. hhh

>I try to set the FS = { ,"}, but it doesn't work. It just separate the words
>within the double quote. Would some body give me advice for this ?

This was discussed ad nauseum last Fall, and should be available from the
archives at Deja News (now http://www.deja.com ) under the subject "From a Perl
Writer".

My own contribution was to suggest that you let awk implicitly split records
into fields at each comma, then iterate through all fields in each record. For
each field beginning with an odd number of double quotes, concatenate fields
(inserting commas between them) until the result ends in an odd number of
double quotes.



Fri, 02 Nov 2001 03:00:00 GMT  
 How can I separate each field in a 'csv' files ?
Hello!
This a rather brute force approach, but it solves the problem, as far as I
could test it. The awk script bellow scans each input record char by char,
looking for commas to do the splitting. However, if it finds a double quote
char, it interrups that procedure until it finds the matching double quote
char. No splitting is then allowed inside quoted text, as desired. The
splitted string is stored in the array named "output".
It's quite easy to turn this script into a function that returns the number of
fields and the output array, like the built-in split function.
I'm posting this in spite of HrlnGrv message, because I couldn't find the
archives he refers to. Anyway, I'm pretty sure there are much better ways of
solving this problem, including the one he suggests.
Hope it helps!
Ze amoreira

---------------------------------------------------------------------

# This function returns the n-th character in the input string
# Lookout, no error checking is done!

function char(string,pos)
{
   return substr(string,pos,1)

Quote:
}

    {
        start = 1
        cnt = 0
        input = $0

#Start scaning the input record and split it at commas
        for (pos = 1; pos <= length(input); ++pos)
        {
           cc = char(input,pos)
#          But note that quoted commas should not be used in splitting.
#          If you find a \" do not try to find splitting commas before the next
#               \" char.
           if (cc == "\"")
           {
              restoi = substr(input,pos+1)
              endoq = index(restoi,"\"")
              pos += endoq
           }
#          Now, do your stuff if you find a splitting comma
           else if (cc == ",")
           {
              ++cnt
              output[cnt] = substr(input,start,pos-start)
              start = pos+2
           }
        }
#       Don't forget the last field, the one that probably doesn't end with a
comma:
        if (char(input,length(input)) !=",")
        {
           ++cnt
           output[cnt] = substr(input,start,length(input)-start+1)
        }
    }
----------------------------------------------------------------------

Quote:

> I would like to separate lines in a comma-separated csv files into fields.
> However, some of the fields in the line are double quoted, and I want to
> treat all the things within the double quote as one field. How can I do this
> ? Some of those double quoted fields also contain commas.
> For example,
> aaa,bbb,ccc,dddd,"eee, fff, ggg",hhhh
> Fields:
> 1. aaa
> 2. bbb
> 3. ccc
> 4. ddd
> 5. eee,fff,ggg
> 6. hhh

> I try to set the FS = { ,"}, but it doesn't work. It just separate the words
> within the double quote. Would some body give me advice for this ? Thanks.





Fri, 02 Nov 2001 03:00:00 GMT  
 How can I separate each field in a 'csv' files ?

writes:

Quote:
>This a rather brute force approach, but it solves the problem, as far as I
>could test it. The awk script bellow scans each input record char by char,
>looking for commas to do the splitting. However, if it finds a double quote
>char, it interrups that procedure until it finds the matching double quote
>char. No splitting is then allowed inside quoted text, as desired. The
>splitted string is stored in the array named "output".

Your code doesn't handle the situation of embedded double quotes inside quoted
strings. Given the text:

'He said, "What the hell?"'

This could be rendered into a double quoted field as

"He said, \"What the hell?\"" (unix-like)

or

"He said, ""What the hell?""" (DOS/Windows-like)

This may be very rare, but the source of the CSV file may require that embedded
double quotes be handled properly. Only the original poster knows for sure.

Quote:
>I'm posting this in spite of HrlnGrv message, because I couldn't find the
>archives he refers to.

Here's a url to a more recent posting of mine on this subject. Going through
records char by char in awk is painful and inefficient. Awk is going to break
each record into fields anyway, so you might as well let awk do some of the
work for you.

http://x34.deja.com/[ST_rn=ps]/getdoc.xp?AN=469017025&CONTEXT=926998656.11
0034948&hitnum=0



Sat, 03 Nov 2001 03:00:00 GMT  
 How can I separate each field in a 'csv' files ?
I recently ran into this and simply used sed to pull out all commas in
between double quotes and then piped the output into awk.  I don't
have the code on this computer, but I think it looked something like:

sed -e '/"*"/s/,//g' <filename> | awk

Quote:
>I would like to separate lines in a comma-separated csv files into fields.
>However, some of the fields in the line are double quoted, and I want to
>treat all the things within the double quote as one field. How can I do this
>? Some of those double quoted fields also contain commas.
>For example,
>aaa,bbb,ccc,dddd,"eee, fff, ggg",hhhh
>Fields:
>1. aaa
>2. bbb
>3. ccc
>4. ddd
>5. eee,fff,ggg
>6. hhh

>I try to set the FS = { ,"}, but it doesn't work. It just separate the words
>within the double quote. Would some body give me advice for this ? Thanks.





Sun, 04 Nov 2001 03:00:00 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. Using CGI module with 'canned queries'

2. Excel users: here's a proc to convert XLS files to CSV format (correction)

3. Excel users: here's a proc to convert XLS files to CSV format

4. It's not bad canned meat...

5. It's not bad canned meat...

6. It's not bad canned meat...

7. It's not bad canned meat...

8. stream('file','c','seek ='x) problem

9. splitting fields in .csv files...

10. How to handle comma in the fields winthin a csv file using awk

11. Memo field problem in csv file

12. Declaring Btrieve file structure with OVER'ed key fields

 

 
Powered by phpBB® Forum Software