Regular Expressions in Rexx to crack a comma delimited file 
Author Message
 Regular Expressions in Rexx to crack a comma delimited file

Patrick
I started another thread on this topic so it's not buried so deep.  I'm
e{*filter*}d by your contribution on RE's and
Rexx.  This will be a great contribution to the language.
I hope you take my comments as constructive not as criticism.

I currently use comma delimited files (.csv) files to communicate back and
forth between my programs
and Excel.  Hence I don't see every possible permutation of a comma
delimited file.
There will generally be:
1. text strings potentially with spaces and commas.  ex:
"1,2-Dichlorobenzene QA Flag"
2. numbers separated by commas.  ex: 1.0235,22.3,4.5
3. blank fields specified by just commas (even at the end). ex: 1.0235,,4.5,

While there are other possiblities, I don't generally deal with them.  Such
as:
4. A Quoted field within quotes: ex: "this "code " is silly"
5. Rexx's  ability to use single and double quotes. ex: 'this "code"
isn'"'"'t silly'
And finally, I'm not deeply into the Unix world, hence I don't have to mess
with
6. Dealing with the escape "\" symbol.  ex:  "this is \" a quote symbol"

So, getting down to business
Can one use a RE in your new code to crack the string with seven fields
(note the last one is blank):

string = ' Chemical:, "1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA
Flag",,,1.2345,'

I've tried:
1. fpat.1 is the RE proposed by an AWK book I have and does not work in the
code below.
2. fpat.2 is the RE proposed by the same AWK book for embedded quotes, it
doesn't work also.
and finally
3. fpat.3, which is what you proposed in the last thread.
Now fpat.3 works great on the hard stuff but fails on the seeminly easy
part.
I fails to send out blank fields.  From the defined string it finds:

#1 Chemical:
#2 1,2-Dichlorobenzene Value
#3 1,2-Dichlorobenzene QA Flag
#4 1,2345
it misses three fields: the two blank fields before the 1.2345 and the one
at the end.

Blank fields are an important issue for me because my Excel sheets rarely
fully populated.

--- regina code:

trace '?r'
rcc = rxfuncadd('reloadfuncs', 'rexxre', 'reloadfuncs')
if rcc then fail('load' rxfuncerrmsg())
call reloadfuncs
fpat.1 = ReComp('"[^"]*"|[^,"]*', 'x')
fpat.2 = ReComp('"([^"]|"")*"|[^,"]*', 'x')
fpat.3 = ReComp('"[^"]*"|[^,"]+', 'x')
string = ' Chemical:,"1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA
Flag",,,1.2345,'
matched = ReParse(fpat.3, somestring, 'vt', 'FIELDS')
  say fields.0 'fields'
  do i = 1 to fields.0
    say "#"i fields.i
    end

REX


% I downloaded Regina and then your rexxre.dll.

Nothing to do with your problem, but since you're using object rexx, I'd
appreciate it if you'd download rexx/trans and the rexx/trans version of
the library, since I have no idea whether it really works with object
rexx....

% trace '?'r
% rcc = rxfuncadd('reloadfuncs', 'rexxre', 'reloadfuncs')
% if rcc then fail('load' rxfuncerrmsg())
% call reloadfuncs
% fpat = ReComp('"[^"]*"|[^,"]*', 'x')
%   somestring = 'alpha,"bravo, which has a comma and ""quotes""", charlie'
%   matched = ReParse(fpat, somestring, 'vt', 'FIELDS')
%    say fields.0 'fields'
%   do i = 1 to fields.0
%     say fields.i
%     end
%
% when I run the program it just freezes .. the trace shows that the code
gets
% to

It doesn't `just freeze' -- it's burning up as much CPU as it possibly can.

[...]

% Any ideas where I'm going wrong.

You followed my advice and tried it? The problem is that the regular
expression can match a zero-length string, and ReParse will not advance
in that case. It does something like this:

   match against 'alpha,"bravo, which has a comma and ""quotes""", charlie'
   fields.1 <- 'alpha'
   match against ',"bravo, which has a comma and ""quotes""", charlie'
   fields.2 <- ''
   match against ',"bravo, which has a comma and ""quotes""", charlie'
   fields.3 <- ''
   ...

If you change the second * to +, you get

   match against 'alpha,"bravo, which has a comma and ""quotes""", charlie'
   fields.1 <- 'alpha'
   match against ',"bravo, which has a comma and ""quotes""", charlie'
   fields.2 <- '"bravo, which has a comma and"'
   match against '"quotes""", charlie'
   fields.3 <- '"quotes"'
   match against '"", charlie'
   fields.4 <- '""'
   match against ', charlie'
   fields.5 <- ' charlie'

I'm not sure exactly what to do in this case. Obviously, looping
continuously
is not acceptible, but on the other hand, I'd like to be able to say

  ref = reComp(re_for_field)
  red = reComp(re_for_delimiter)
  reParse(ref, somestring, 'v', 'fieldvar1', 'somestring')
  reParse(red, somestring, 'v', 'delimvar1', 'somestring')
  reParse(ref, somestring, 'v', 'fieldvar2', 'somestring')
  reParse(red, somestring, 'v', 'delimvar2', 'somestring')
  ...

so I can't always blindly advance over the delimiter (whatever it is)
when matching against fields. I probably have to in the case when I'm
populating stems, though.

I'm going to think about this a bit. It's not a test case I would have
come up with on my own, but it's important.
--

Patrick TJ McPhee
East York  Canada



Wed, 19 Oct 2005 01:02:34 GMT  
 Regular Expressions in Rexx to crack a comma delimited file
This is my latest effort in comparing RE in ORexx, Awk and Regina.  There
are most likely
better RE's and obviously better coding can be done.
The Awk string when declared within the progam needs
the backslash escape symbol to allow for double quotes to be expressed as
part of the string.

REX

Code and output follow:
------------------------------------------
#Awk
BEGIN{
string = " Chemical:,\"1,2-Dichlorobenzene Value\",\"1,2-Dichlorobenzene QA
Flag\",,,1.2345,"
fpat = /"[^"]*"|[^,"]*/
splitp(string,x,fpat)
for (i in x) print "#" i , x[i]

Quote:
}

Awk Output:  Gold standard for interpreting RE (?)...all fields accounted
for.
#1  Chemical:
#2 "1,2-Dichlorobenzene Value"
#3 "1,2-Dichlorobenzene QA Flag"
#4
#5
#6 1.2345
#7

----------------------------------------
/*ORexx*/
csv = .array~of()
fpat = .RegularExpression~new('"[^"]*"|[^,]*')
str=' Chemical:,"1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA
Flag",,,1.2345,'
i = 0
do while str~length() \= 0
   start = fpat~pos(str)
   end   = fpat~position
   i=i+1
   csv[i] = str~substr(start,end)
   say "#"i   csv[i]
   str = str~substr(end+2)
   end
::requires "rxregexp.cls"

Output:  Note that # 7 at end is missing (mostly not a critical issue)...
code seems labored.
#1  Chemical:
#2 "1,2-Dichlorobenzene Value"
#3 "1,2-Dichlorobenzene QA Flag"
#4
#5
#6 1.2345
--------------------------------------------
/*Regina*/
rcc = rxfuncadd('reloadfuncs', 'rexxre', 'reloadfuncs')
if rcc then fail('load' rxfuncerrmsg())
call reloadfuncs
fpat = ReComp('"[^"]*"|[^,"]+', 'x')
string = ' Chemical:,"1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA
Flag",,,1.2345,'
matched = ReParse(fpat, string, 'vt', 'FIELDS')
  do i = 1 to fields.0
    say "#"i fields.i
    end

Output: As with blank fields missing.  Note RE is different than the ORexx
and Awk,
and maybe this is not the best choice for Regina.
#1  Chemical:
#2 "1,2-Dichlorobenzene Value"
#3 "1,2-Dichlorobenzene QA Flag"
#4 1.2345



Wed, 19 Oct 2005 02:02:04 GMT  
 Regular Expressions in Rexx to crack a comma delimited file

% I started another thread on this topic so it's not buried so deep.  I'm
% e{*filter*}d by your contribution on RE's and
% Rexx.  This will be a great contribution to the language.

Thanks. I'd like to think so, too, but this little problem does
highlight the limitations that need to be overcome.

% string = ' Chemical:, "1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA
% Flag",,,1.2345,'

Given this string and fpat.1, plus my initial fix to the looping problem,
I get
  1  Chemical:
  2  <zero-length>
  3  <one space>
  4  "1,2-Dichlorobenzene Value"
  5  <zero-length>
  6  "1,2-Dichlorobenzene QA Flag"
  7  <zero-length>
  8  <zero-length>
  9  <zero-length>
  10 1.2345
  11 <zero-length>

What makes this slightly worse, 11 is not matching the zero-length field
at the end. It's a zero-length match just before the comma at the end.
Similarly, 2, 5 and 7 are mis-matches. 3 is a valid match based on the
specified pattern, but not desirable.

Question: with tawk, do you just have to set FPAT, or do you have to
set FPAT and FS? If the latter, the solution would be to pass two REs
to the parse operation, one to match fields, and one to match delimiters
(currently, you have a choice of whether your one RE is supposed to match
fields (which I called v for values) or delimiters.

Another solution which is more verbose, and closely tied to the number
of fields, but more reliable and probably faster is:

  fieldre = '("[^"]*"|[^,"]*)'
  delimre = ' *, *'
  re = fieldre
  /* 6 is one less than the number of fields expected */
  do 6
    re = re || delimre || fieldre
    end

  /* reComp that if it will be used repeatedly ... */
  if reParse(re, string, 'stx', 'fields') then
    do i = 1 to fields.i
       say i fields.i
       end

which gives

 1  Chemical:
 2 "1,2-Dichlorobenzene Value"
 3 "1,2-Dichlorobenzene QA Flag"
 4
 5
 6 1.2345
 7

which is what you want. That's got to have all those people who love
rexx because it's so readable watering at the mouth. Quick explanation
for people who are trying to follow all this:

  fieldre = '("[^"]*"|[^,"]*)'

This is an `extended' regular expression which matches either
 1 a quote-mark followed by zero or more characters which are
   not quote marks, followed by another quote mark, or
 2 zero or more characters which are neither commas nor quote marks.

The parentheses cause this to be marked as a sub-expression. Among
other fabulous features, the part of the input matching a
sub-expression can be returned from a reParse or reExec call,
or the position of the match can be returned from an reExec call.
Incidentally, the parentheses are what makes this an extended
RE. The same thing could be expressed as a `basic' RE with

  fieldre = '\("[^"]*"|[^,"]*\)'

If you're used to awk, you'll be using extended regular expressions.

  delimre = ' *, *'

This is either a basic or extended RE which matches zero or
more spaces, a comma, and zero or more spaces.

  re = fieldre
  /* 6 is one less than the number of fields expected */
  do 6
    re = re || delimre || fieldre
    end

This builds up a string that has seven sub-expressions, each of which
matches a field value, separated by six delimiters.

  /* reComp that if it will be used repeatedly ... */
  if reParse(re, string, 'stx', 'fields') then

This evaluates the RE against string and assigns the sub-expression matches
to the contents of a stem called fields. The third argument are flags --
s says to match sub-expressions to variables. t says the variable name
should be treated as a sTem rather than a normal variable. x says the
regular expression is an extended re.
--

Patrick TJ McPhee
East York  Canada



Wed, 19 Oct 2005 02:51:38 GMT  
 Regular Expressions in Rexx to crack a comma delimited file

% This is my latest effort in comparing RE in ORexx, Awk and Regina.  There

One point -- the awk code is not strictly awk. splitp() is available
only in one interpreter and is not part of the standard (which is not
to say it's not useful).

% /*ORexx*/

You could do something like this with RexxRE as well:

  /* fpat = .RegularExpression~new('"[^"]*"|[^,]*') */
  fpat = ReComp('"[^"]*"|[^,"]*', 'x')
  fdel = ReComp(' *, *')

  str = ' Chemical:, "1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA Flag",,,1.2345,'

  i = 0
  do while length(str) > 0
    if ReParse(fpat, str, 'v', 'var', 'str') then do
      i = i + 1
      csv.i = var
      /* strip the comma & spaces */
      call ReParse fdel, str, 'v', 'del', 'str'
      end
    else
      leave
    end

   csv.0 = i

    do i = 1 to csv.0  
      say i '?'csv.i'?'
      end

% Output:  Note that # 7 at end is missing (mostly not a critical issue)...
% code seems labored.
% #1  Chemical:
% #2 "1,2-Dichlorobenzene Value"
% #3 "1,2-Dichlorobenzene QA Flag"
% #4
% #5
% #6 1.2345

The same problems exist, but could be coded around. You can also do
something much more similar to the orexx code

  fpat = ReComp('"[^"]*"|[^,"]*', 'x')

  str = ' Chemical:,"1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA Flag",,,1.2345,'

  i = 0
  do while length(str) > 0
    if ReExec(fpat, str, 'vars', 'p') then do
      i = i + 1
      parse var vars.!match start len
      csv.i = substr(str, start, start+len-1)
      /* skip over that field */
      str = substr(str, start+len+1)
      end
    else
      leave
    end

  csv.0 = i

    do i = 1 to csv.0  
      say i '?'csv.i'?'
      end

Although to get the same results, I needed to delete the space
following Chemical:,
--

Patrick TJ McPhee
East York  Canada



Wed, 19 Oct 2005 03:21:49 GMT  
 Regular Expressions in Rexx to crack a comma delimited file
Patrick
You asked:

Quote:
> Question: with tawk, do you just have to set FPAT, or do you have to
> set FPAT and FS? If the latter, the solution would be to pass two REs
> to the parse operation, one to match fields, and one to match delimiters
> (currently, you have a choice of whether your one RE is supposed to match
> fields (which I called v for values) or delimiters.

In Tawk, the manual recommends that when FPAT is used to set FS = 0.  This
probably
implies that both can be active at the same time (?).  This is something
that I haven't considered.
The normally active FS is what the manual calls "whitespace", i.e. any
number of spaces or tabs between
fields are considered as a single delimeter.  Much the same as Rexx's Parse
command.  So far I haven't
been setting FS to 0 when I have run my little Awk programs so I can't say
for sure ift FS is active or
not. I'll have to experiment!

I'll have to think about what your are saying next...but the results look
very promising.  Now will
the two types of RE be hierarchical in which one is interpreted first?  I am
thinking that if the first RE part
is satisfied, then the RE moves on down the string to continue parsing. If
not, then the next part of the
RE is attempted.  I can see extreme power in such an "Select-When" type of
scheme.

If I'm not mistaken, Rexx's Parse syntax does not allow for an FPAT type of
selection, just on what
the FS or delimiter is. Or does it?  Rexx's Parse is so powerful and
adaptable, I hesitate to make
any strong statement of what it can or can't do.

Quote:
> Another solution which is more verbose, and closely tied to the number
> of fields, but more reliable and probably faster is:

>   fieldre = '("[^"]*"|[^,"]*)'
>   delimre = ' *, *'
>   re = fieldre
>   /* 6 is one less than the number of fields expected */
>   do 6
>     re = re || delimre || fieldre
>     end

>   /* reComp that if it will be used repeatedly ... */
>   if reParse(re, string, 'stx', 'fields') then
>     do i = 1 to fields.i
>        say i fields.i
>        end

> which gives

>  1  Chemical:
>  2 "1,2-Dichlorobenzene Value"
>  3 "1,2-Dichlorobenzene QA Flag"
>  4
>  5
>  6 1.2345
>  7

Now I have to go keep 2 thir{*filter*} year-olds (no school today) from getting
into trouble which would then get me into
trouble with the Boss..

REX



Wed, 19 Oct 2005 03:49:06 GMT  
 Regular Expressions in Rexx to crack a comma delimited file

% In Tawk, the manual recommends that when FPAT is used to set FS = 0.  This
% probably
% implies that both can be active at the same time (?).  This is something
% that I haven't considered.

I think I've come up with my solution. The problem was this: my parse
routine can match fields three ways -- the RE can match the field delimiter,
in which case it does something like this:
  match the first delimiter
  assign everything to the left to field 1
  delete up to the end of the match from the search string
  match the second delimiter
  etc

or it can match the field itself:

  match the first field
  assign the match to field 1
  delete up to the end of the match from the search string
  match the second field
  etc

both of these fall apart when the the RE can match a zero-length string,
since RE matching works on the basis of `first, longest match'. That is,
when you start matching from the end of the field, it will either match
a new field starting right after the first field, or it will match a
zero-length string starting right after the first field. What I need to
do is ensure that the end-of-match pointer moves between two matches,
and ignore cases where that doesn't happen.

(The third way of matching is to map parenthesised sub-expressions to
fields, where this problem can't occur).

% I'll have to think about what your are saying next...but the results look
% very promising.  Now will
% the two types of RE be hierarchical in which one is interpreted first?

I don't think I understand the question. If you refer to `basic' vs `extended'
regular expressions, they're slightly different syntaxes specified by posix,
and you have to tell the library which one you're typing. The `x' flag
in ReComp, ReExec, and ReParse does this.

% I am
% thinking that if the first RE part
% is satisfied, then the RE moves on down the string to continue parsing. If
% not, then the next part of the
% RE is attempted.  I can see extreme power in such an "Select-When" type of
% scheme.

I really don't understand the question. This seems to be about parenthesised
sub-expressions. Suppose I have this RE:

  ' *([a-z]+) +([a-z]+)'

That will match zero or more spaces, one or more letters, one or more
spaces, and one or more letters. If the input string doesn't have a
substring of this nature, the RE won't match. If it does, the first
group of letters is field 1, and the second group is field 2 (in the
case of ReParse).

% If I'm not mistaken, Rexx's Parse syntax does not allow for an FPAT type of
% selection, just on what
% the FS or delimiter is. Or does it?

[talking about the real rexx parse instruction]
You can specify only delimiters and positions within the string being parsed.

--

Patrick TJ McPhee
East York  Canada



Thu, 20 Oct 2005 04:19:31 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. Export Clarion .DAT files to ASCII comma delimited files

2. Export Clarion .DAT files to ASCII comma delimited files

3. getting fields NOT comma delimited with commas inside

4. Parsing Comma delimited files in J

5. matching records in a comma delimited file

6. Comma delimited file problem

7. 2.01 Comma delimited ASCII file

8. Import comma delimited text file

9. VW code to read comma-delimited text files??

10. importing from a comma delimited file

11. Comma delimited file?

12. comma delimited file

 

 
Powered by phpBB® Forum Software