Parsing a CSV file containing quoted strings (themselves containing commas) 
Author Message
 Parsing a CSV file containing quoted strings (themselves containing commas)

Hello,

I have a file with lines like this :
1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf", "klkmmlk",
"jiuiieziure, ioiooii"

As you can see the fields are separated by commas. I simplyneed to extract
some columns to make another file, which is one of the most usual task for
wak. However I face two problems :
- Some columns are strings surrounded by speechmarks. I want to get rid of
them.
- Some of these strings may contain commas, which confuse awk because it
take them as separator without knowing that they are part of the field. I
didn't manage to make awk understand that in this case a comma must not be
used a a separator.

The only solution I found is the following :
- Parse the file a first time with awk, character by character, finding
myself the commas in strings and the speecmarks to eliminate them (I don't
need the speechmarks, neither the commas in the strings). I keep all the
other characters.
- Parse the result again to grab the columns I want.

The first parsing takes  a lot of time because I must go character by
characters...

Is there a more elegant way to do what I need ? Maybe awk cannot do it
better. In this case I can switch to another tool if you have a suggestion.

Best regards.



Fri, 13 Sep 2002 03:00:00 GMT  
 Parsing a CSV file containing quoted strings (themselves containing commas)
Awk cannot do it?!?!  NEVER!

run command :    awk -f ref.awk input.dat | awk -f ref2.awk > output.dat

ref.awk reformats the file with | as the field separator, ref2.awk is just a
sample to show you how to use FS to deal with new field sep so you can write
your own awk to grab the columns you want.

It should run reasonably fast.

Robert L.

======start of ref.awk======
#
# Assumption: records do NOT contain | character!!!!
# Assumption: quoted fields contain either 0 or 1 commas
#
# take line apart and build table of fields
{gsub(/,/,"|",$0)               # change commas to bar
 gsub(/ \|/,"|",$0)             # get rid of spaces before and after bar
 gsub(/\| /,"|",$0)
 split($0,pad,"|")              # split the line into table pad[]
 for (x in pad) {               # rebuild and deal with embedded comma's
        rq = substr(pad[x],1,1)
        q = gsub(/\"/,"",pad[x])
        if (q == 1 && rq == "\"") {
                gsub(/\"/,"",pad[x+1])
                pad[x] = pad[x] ", " pad[x+1]
                pad[x+1] = "" }
        }

Quote:
}

#rebuild line with new field separators and print
{ rec = ""
 for (x in pad)

        if (rec == "") {
                rec = pad[x] }
        else {
                if (pad[x] != "") rec = rec "|" pad[x] }        #skip null
fields
        }
 print rec

Quote:
}

======end  of ref.awk======

======start of ref2.awk======
BEGIN { FS="|" }

{ print "Record: " NR
 print "Field  1: " $1
 print "Field  2: " $2
 print "Field  3: " $3
 print "Field  4: " $4
 print "Field  5: " $5
 print "Field  6: " $6
 print "Field  7: " $7
 print "Field  8: " $8
 print "Field  9: " $9
 print "Field 10: " $10 }
======end  of ref2.awk======

======start of input.dat======
1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf",
"klkmmlk","jiuiieziure, ioiooii"
1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf",
"klkmmlk","jiuiieziure, ioiooii"
1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf",
"klkmmlk","jiuiieziure, ioiooii"
1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf",
"klkmmlk","jiuiieziure, ioiooii"
======end  of input.dat======

======start of my output.dat==
Record: 1
Field  1: 1
Field  2: 45
Field  3: 789
Field  4: 41
Field  5: popo
Field  6: klrpkretp
Field  7: rrtt
Field  8: klklkllk, djkdjf
Field  9: klkmmlk
Field 10: jiuiieziure, ioiooii
Record: 2
Field  1: 1
Field  2: 45
Field  3: 789
Field  4: 41
Field  5: popo
Field  6: klrpkretp
Field  7: rrtt
Field  8: klklkllk, djkdjf
Field  9: klkmmlk
Field 10: jiuiieziure, ioiooii
Record: 3
Field  1: 1
Field  2: 45
Field  3: 789
Field  4: 41
Field  5: popo
Field  6: klrpkretp
Field  7: rrtt
Field  8: klklkllk, djkdjf
Field  9: klkmmlk
Field 10: jiuiieziure, ioiooii
Record: 4
Field  1: 1
Field  2: 45
Field  3: 789
Field  4: 41
Field  5: popo
Field  6: klrpkretp
Field  7: rrtt
Field  8: klklkllk, djkdjf
Field  9: klkmmlk
Field 10: jiuiieziure, ioiooii
======end  of my output.dat==


Quote:
> Hello,

> I have a file with lines like this :
> 1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf", "klkmmlk",
> "jiuiieziure, ioiooii"

> As you can see the fields are separated by commas. I simplyneed to extract
> some columns to make another file, which is one of the most usual task for
> wak. However I face two problems :
> - Some columns are strings surrounded by speechmarks. I want to get rid of
> them.
> - Some of these strings may contain commas, which confuse awk because it
> take them as separator without knowing that they are part of the field. I
> didn't manage to make awk understand that in this case a comma must not be
> used a a separator.

> The only solution I found is the following :
> - Parse the file a first time with awk, character by character, finding
> myself the commas in strings and the speecmarks to eliminate them (I don't
> need the speechmarks, neither the commas in the strings). I keep all the
> other characters.
> - Parse the result again to grab the columns I want.

> The first parsing takes  a lot of time because I must go character by
> characters...

> Is there a more elegant way to do what I need ? Maybe awk cannot do it
> better. In this case I can switch to another tool if you have a
suggestion.

> Best regards.



Sat, 14 Sep 2002 03:00:00 GMT  
 Parsing a CSV file containing quoted strings (themselves containing commas)

Quote:

> I have a file with lines like this :
> 1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf", "klkmmlk",
> "jiuiieziure, ioiooii"

> As you can see the fields are separated by commas. I simplyneed to
extract
> some columns to make another file, which is one of the most usual
task for
> wak. However I face two problems :
> - Some columns are strings surrounded by speechmarks. I want to get
rid of
> them.
> - Some of these strings may contain commas, which confuse awk because
it
> take them as separator without knowing that they are part of the
field. I
> didn't manage to make awk understand that in this case a comma must
not be
> used a a separator.

> The only solution I found is the following :
> - Parse the file a first time with awk, character by character,
finding
> myself the commas in strings and the speecmarks to eliminate them (I
don't
> need the speechmarks, neither the commas in the strings). I keep all
the
> other characters.
> - Parse the result again to grab the columns I want.

> The first parsing takes a lot of time because I must go character by
> characters...

> Is there a more elegant way to do what I need ? Maybe awk cannot do it
> better. In this case I can switch to another tool if you have a

suggestion.

As someone said. Comma must be the worst possible field separator to
choose. Anyway. You should search the archives for "Parsing CSV" and
you'll find lots and lots of input. A few months ago we took this
several rounds and here's my suggestion from that thread:

#!/usr/bin/awk -f
BEGIN { FS=SUBSEP; OFS="|" }

{
  result = setcsv($0, ",")
  #print

Quote:
}

# setcsv(str, sep) - parse CSV (MS specification) input
# str, the string to be parsed. (Most likely $0.)
# sep, the separator between the values.
#
# After a call to setcsv the parsed fields are found in $1 to $NF.
# setcsv returns 1 on sucess and 0 on failure.
#
# By Peter Str?mberg aka PEZ.
# Based on setcsv by Adrian Davis. Modified to handle a separator
# of choice and embedded newlines. The basic approach is to take the
# burden off of the regular expression matching by replacing ambigious
# characters with characters unlikely to be found in the input. For
# this the characters "\035".
#
# Note 1. Prior to calling setcsv you must set FS to a character which
#         can never be found in the input. (Consider SUBSEP.)
# Note 2. If setcsv can't find the closing double quote for the string
#         in str it will consume the next line of input by calling
#         getline and call itself until it finds the closing double
#         qoute or no more input is available (considered a failiure).
# Note 3. Only the "" representation of a literal quote is supported.
# Note 4. setcsv will probably missbehave if sep used as a regular
#         expression can match anything else than a call to index()
#         would match.
#
function setcsv(str, sep, i) {
  gsub(/""/, "\035", str)
  gsub(sep, FS, str)

  while (match(str, /"[^"]*"/)) {
    middle = substr(str, RSTART+1, RLENGTH-2)
    gsub(FS, sep, middle)
    str = sprintf("%.*s%s%s", RSTART-1, str, middle,
      substr(str, RSTART+RLENGTH))
  }

  if (index(str, "\"")) {
    return ((getline) > 0) ? setcsv(str (RT != "" ? RT : RS) $0,
sep) : !setcsv(str "\"", sep)
  } else {
    gsub(/\035/, "\"", str)
    $0 = str

    for (i = 1; i <= NF; i++)
      if (match($i, /^"+$/))
        $i = substr($i, 2)

    $1 = $1 ""
    return 1
  }

Quote:
}

Deja will probably{*filter*}the indents of the script. Use = in VIM or the
equivalent in whatever editor you are using to fix it.

Regards,
/Peter
--
-= Spam safe(?) e-mail address: pez68 at netscape.net =-

Sent via Deja.com http://www.*-*-*.com/
Before you buy.



Sat, 14 Sep 2002 03:00:00 GMT  
 Parsing a CSV file containing quoted strings (themselves containing commas)

Quote:

>Hello,

>I have a file with lines like this :
>1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf", "klkmmlk",
>"jiuiieziure, ioiooii"

>As you can see the fields are separated by commas. I simplyneed to extract
>some columns to make another file, which is one of the most usual task for
>wak. However I face two problems :
>- Some columns are strings surrounded by speechmarks. I want to get rid of
>them.
>- Some of these strings may contain commas, which confuse awk because it
>take them as separator without knowing that they are part of the field. I
>didn't manage to make awk understand that in this case a comma must not be
>used a a separator.

>The only solution I found is the following :
>- Parse the file a first time with awk, character by character, finding
>myself the commas in strings and the speecmarks to eliminate them (I don't
>need the speechmarks, neither the commas in the strings). I keep all the
>other characters.
>- Parse the result again to grab the columns I want.

>The first parsing takes  a lot of time because I must go character by
>characters...

>Is there a more elegant way to do what I need ? Maybe awk cannot do it
>better. In this case I can switch to another tool if you have a suggestion.

If you assume that each line will have complete quoted strings, then you
can assume that if you used " as the FS that the even numbered fields
would be the ones within quotes, so to replace the commas in such fields
with a space one could do the following:

awk -F'"' '/"/ {for(i=2;i<=NF;i+=2){gsub(/,/, " ", $i)}}{print}' infile

The above replaces the double quotes and the commas with a space.

To eliminate the double quotes, make the OFS="" like this:

awk -F'"' 'BEGIN{OFS=""}
           /"/ {for(i=2;i<=NF;i+=2){gsub(/,/, " ", $i)}}{print}' infile

for your infile:

1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf", "klkmmlk",
"jiuiieziure, ioiooii"

this last script produces:

1, 45, 789, 41, popo, klrpkretp, rrtt, klklkllk djkdjf, klkmmlk,
jiuiieziure  ioiooii

The assumption of matched double quotes on each and every line might not be
valid for your data, but if it is, this approach would work.

Some CSV files can be screwed up and have badly quoted strings.

You might want to check if there are any lines without matched quotes
first, something like this should tell you how many lines have
unmatched double quotes:

sed -e 's/[^"]//g;s/""//g' infile | grep '"' | wc -l

You could check with awk too.  Perhaps something more complex:

awk '/"/{a=$0;gsub(/[^"]/,"",a); if(length(a)%2==1){cnt++;
     print cnt ". Record", NR, "had unmatched quotes:"; print}}
     END{if(cnt==0){print "No unmatched double quotes"}
         if(cnt>1){print "There were",cnt,"problem lines"}}' infile

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Sat, 14 Sep 2002 03:00:00 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. using gsub on a string containing single quotes

2. parsing CSV files with quotes

3. Problem Importing numbers containing commas

4. problems with php and sql statement containing a comma

5. Problem with string module and strings containing \0

6. string match fails for strings containing "["

7. Regina REXX - How to Parse Arguments Containing Spaces

8. compare array containing strings with numbers

9. Q:If string contains as opposed to is?

10. in Forth, does the empty string contain itself?

11. how to pass a cluster containing strings to a dll

12. Getting a class from a string containing its name

 

 
Powered by phpBB® Forum Software