Regular Expressions in Rexx to crack a comma delimited file
Author |
Message |
T-Re #1 / 6
|
 Regular Expressions in Rexx to crack a comma delimited file
Patrick I started another thread on this topic so it's not buried so deep. I'm e{*filter*}d by your contribution on RE's and Rexx. This will be a great contribution to the language. I hope you take my comments as constructive not as criticism. I currently use comma delimited files (.csv) files to communicate back and forth between my programs and Excel. Hence I don't see every possible permutation of a comma delimited file. There will generally be: 1. text strings potentially with spaces and commas. ex: "1,2-Dichlorobenzene QA Flag" 2. numbers separated by commas. ex: 1.0235,22.3,4.5 3. blank fields specified by just commas (even at the end). ex: 1.0235,,4.5, While there are other possiblities, I don't generally deal with them. Such as: 4. A Quoted field within quotes: ex: "this "code " is silly" 5. Rexx's ability to use single and double quotes. ex: 'this "code" isn'"'"'t silly' And finally, I'm not deeply into the Unix world, hence I don't have to mess with 6. Dealing with the escape "\" symbol. ex: "this is \" a quote symbol" So, getting down to business Can one use a RE in your new code to crack the string with seven fields (note the last one is blank): string = ' Chemical:, "1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA Flag",,,1.2345,' I've tried: 1. fpat.1 is the RE proposed by an AWK book I have and does not work in the code below. 2. fpat.2 is the RE proposed by the same AWK book for embedded quotes, it doesn't work also. and finally 3. fpat.3, which is what you proposed in the last thread. Now fpat.3 works great on the hard stuff but fails on the seeminly easy part. I fails to send out blank fields. From the defined string it finds: #1 Chemical: #2 1,2-Dichlorobenzene Value #3 1,2-Dichlorobenzene QA Flag #4 1,2345 it misses three fields: the two blank fields before the 1.2345 and the one at the end. Blank fields are an important issue for me because my Excel sheets rarely fully populated. --- regina code: trace '?r' rcc = rxfuncadd('reloadfuncs', 'rexxre', 'reloadfuncs') if rcc then fail('load' rxfuncerrmsg()) call reloadfuncs fpat.1 = ReComp('"[^"]*"|[^,"]*', 'x') fpat.2 = ReComp('"([^"]|"")*"|[^,"]*', 'x') fpat.3 = ReComp('"[^"]*"|[^,"]+', 'x') string = ' Chemical:,"1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA Flag",,,1.2345,' matched = ReParse(fpat.3, somestring, 'vt', 'FIELDS') say fields.0 'fields' do i = 1 to fields.0 say "#"i fields.i end REX
% I downloaded Regina and then your rexxre.dll. Nothing to do with your problem, but since you're using object rexx, I'd appreciate it if you'd download rexx/trans and the rexx/trans version of the library, since I have no idea whether it really works with object rexx.... % trace '?'r % rcc = rxfuncadd('reloadfuncs', 'rexxre', 'reloadfuncs') % if rcc then fail('load' rxfuncerrmsg()) % call reloadfuncs % fpat = ReComp('"[^"]*"|[^,"]*', 'x') % somestring = 'alpha,"bravo, which has a comma and ""quotes""", charlie' % matched = ReParse(fpat, somestring, 'vt', 'FIELDS') % say fields.0 'fields' % do i = 1 to fields.0 % say fields.i % end % % when I run the program it just freezes .. the trace shows that the code gets % to It doesn't `just freeze' -- it's burning up as much CPU as it possibly can. [...] % Any ideas where I'm going wrong. You followed my advice and tried it? The problem is that the regular expression can match a zero-length string, and ReParse will not advance in that case. It does something like this: match against 'alpha,"bravo, which has a comma and ""quotes""", charlie' fields.1 <- 'alpha' match against ',"bravo, which has a comma and ""quotes""", charlie' fields.2 <- '' match against ',"bravo, which has a comma and ""quotes""", charlie' fields.3 <- '' ... If you change the second * to +, you get match against 'alpha,"bravo, which has a comma and ""quotes""", charlie' fields.1 <- 'alpha' match against ',"bravo, which has a comma and ""quotes""", charlie' fields.2 <- '"bravo, which has a comma and"' match against '"quotes""", charlie' fields.3 <- '"quotes"' match against '"", charlie' fields.4 <- '""' match against ', charlie' fields.5 <- ' charlie' I'm not sure exactly what to do in this case. Obviously, looping continuously is not acceptible, but on the other hand, I'd like to be able to say ref = reComp(re_for_field) red = reComp(re_for_delimiter) reParse(ref, somestring, 'v', 'fieldvar1', 'somestring') reParse(red, somestring, 'v', 'delimvar1', 'somestring') reParse(ref, somestring, 'v', 'fieldvar2', 'somestring') reParse(red, somestring, 'v', 'delimvar2', 'somestring') ... so I can't always blindly advance over the delimiter (whatever it is) when matching against fields. I probably have to in the case when I'm populating stems, though. I'm going to think about this a bit. It's not a test case I would have come up with on my own, but it's important. -- Patrick TJ McPhee East York Canada
|
Wed, 19 Oct 2005 01:02:34 GMT |
|
 |
T-Re #2 / 6
|
 Regular Expressions in Rexx to crack a comma delimited file
This is my latest effort in comparing RE in ORexx, Awk and Regina. There are most likely better RE's and obviously better coding can be done. The Awk string when declared within the progam needs the backslash escape symbol to allow for double quotes to be expressed as part of the string. REX Code and output follow: ------------------------------------------ #Awk BEGIN{ string = " Chemical:,\"1,2-Dichlorobenzene Value\",\"1,2-Dichlorobenzene QA Flag\",,,1.2345," fpat = /"[^"]*"|[^,"]*/ splitp(string,x,fpat) for (i in x) print "#" i , x[i] Quote: }
Awk Output: Gold standard for interpreting RE (?)...all fields accounted for. #1 Chemical: #2 "1,2-Dichlorobenzene Value" #3 "1,2-Dichlorobenzene QA Flag" #4 #5 #6 1.2345 #7 ---------------------------------------- /*ORexx*/ csv = .array~of() fpat = .RegularExpression~new('"[^"]*"|[^,]*') str=' Chemical:,"1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA Flag",,,1.2345,' i = 0 do while str~length() \= 0 start = fpat~pos(str) end = fpat~position i=i+1 csv[i] = str~substr(start,end) say "#"i csv[i] str = str~substr(end+2) end ::requires "rxregexp.cls" Output: Note that # 7 at end is missing (mostly not a critical issue)... code seems labored. #1 Chemical: #2 "1,2-Dichlorobenzene Value" #3 "1,2-Dichlorobenzene QA Flag" #4 #5 #6 1.2345 -------------------------------------------- /*Regina*/ rcc = rxfuncadd('reloadfuncs', 'rexxre', 'reloadfuncs') if rcc then fail('load' rxfuncerrmsg()) call reloadfuncs fpat = ReComp('"[^"]*"|[^,"]+', 'x') string = ' Chemical:,"1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA Flag",,,1.2345,' matched = ReParse(fpat, string, 'vt', 'FIELDS') do i = 1 to fields.0 say "#"i fields.i end Output: As with blank fields missing. Note RE is different than the ORexx and Awk, and maybe this is not the best choice for Regina. #1 Chemical: #2 "1,2-Dichlorobenzene Value" #3 "1,2-Dichlorobenzene QA Flag" #4 1.2345
|
Wed, 19 Oct 2005 02:02:04 GMT |
|
 |
Patrick TJ McPh #3 / 6
|
 Regular Expressions in Rexx to crack a comma delimited file
% I started another thread on this topic so it's not buried so deep. I'm % e{*filter*}d by your contribution on RE's and % Rexx. This will be a great contribution to the language. Thanks. I'd like to think so, too, but this little problem does highlight the limitations that need to be overcome. % string = ' Chemical:, "1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA % Flag",,,1.2345,' Given this string and fpat.1, plus my initial fix to the looping problem, I get 1 Chemical: 2 <zero-length> 3 <one space> 4 "1,2-Dichlorobenzene Value" 5 <zero-length> 6 "1,2-Dichlorobenzene QA Flag" 7 <zero-length> 8 <zero-length> 9 <zero-length> 10 1.2345 11 <zero-length> What makes this slightly worse, 11 is not matching the zero-length field at the end. It's a zero-length match just before the comma at the end. Similarly, 2, 5 and 7 are mis-matches. 3 is a valid match based on the specified pattern, but not desirable. Question: with tawk, do you just have to set FPAT, or do you have to set FPAT and FS? If the latter, the solution would be to pass two REs to the parse operation, one to match fields, and one to match delimiters (currently, you have a choice of whether your one RE is supposed to match fields (which I called v for values) or delimiters. Another solution which is more verbose, and closely tied to the number of fields, but more reliable and probably faster is: fieldre = '("[^"]*"|[^,"]*)' delimre = ' *, *' re = fieldre /* 6 is one less than the number of fields expected */ do 6 re = re || delimre || fieldre end /* reComp that if it will be used repeatedly ... */ if reParse(re, string, 'stx', 'fields') then do i = 1 to fields.i say i fields.i end which gives 1 Chemical: 2 "1,2-Dichlorobenzene Value" 3 "1,2-Dichlorobenzene QA Flag" 4 5 6 1.2345 7 which is what you want. That's got to have all those people who love rexx because it's so readable watering at the mouth. Quick explanation for people who are trying to follow all this: fieldre = '("[^"]*"|[^,"]*)' This is an `extended' regular expression which matches either 1 a quote-mark followed by zero or more characters which are not quote marks, followed by another quote mark, or 2 zero or more characters which are neither commas nor quote marks. The parentheses cause this to be marked as a sub-expression. Among other fabulous features, the part of the input matching a sub-expression can be returned from a reParse or reExec call, or the position of the match can be returned from an reExec call. Incidentally, the parentheses are what makes this an extended RE. The same thing could be expressed as a `basic' RE with fieldre = '\("[^"]*"|[^,"]*\)' If you're used to awk, you'll be using extended regular expressions. delimre = ' *, *' This is either a basic or extended RE which matches zero or more spaces, a comma, and zero or more spaces. re = fieldre /* 6 is one less than the number of fields expected */ do 6 re = re || delimre || fieldre end This builds up a string that has seven sub-expressions, each of which matches a field value, separated by six delimiters. /* reComp that if it will be used repeatedly ... */ if reParse(re, string, 'stx', 'fields') then This evaluates the RE against string and assigns the sub-expression matches to the contents of a stem called fields. The third argument are flags -- s says to match sub-expressions to variables. t says the variable name should be treated as a sTem rather than a normal variable. x says the regular expression is an extended re. -- Patrick TJ McPhee East York Canada
|
Wed, 19 Oct 2005 02:51:38 GMT |
|
 |
Patrick TJ McPh #4 / 6
|
 Regular Expressions in Rexx to crack a comma delimited file
% This is my latest effort in comparing RE in ORexx, Awk and Regina. There One point -- the awk code is not strictly awk. splitp() is available only in one interpreter and is not part of the standard (which is not to say it's not useful). % /*ORexx*/ You could do something like this with RexxRE as well: /* fpat = .RegularExpression~new('"[^"]*"|[^,]*') */ fpat = ReComp('"[^"]*"|[^,"]*', 'x') fdel = ReComp(' *, *') str = ' Chemical:, "1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA Flag",,,1.2345,' i = 0 do while length(str) > 0 if ReParse(fpat, str, 'v', 'var', 'str') then do i = i + 1 csv.i = var /* strip the comma & spaces */ call ReParse fdel, str, 'v', 'del', 'str' end else leave end csv.0 = i do i = 1 to csv.0 say i '?'csv.i'?' end % Output: Note that # 7 at end is missing (mostly not a critical issue)... % code seems labored. % #1 Chemical: % #2 "1,2-Dichlorobenzene Value" % #3 "1,2-Dichlorobenzene QA Flag" % #4 % #5 % #6 1.2345 The same problems exist, but could be coded around. You can also do something much more similar to the orexx code fpat = ReComp('"[^"]*"|[^,"]*', 'x') str = ' Chemical:,"1,2-Dichlorobenzene Value","1,2-Dichlorobenzene QA Flag",,,1.2345,' i = 0 do while length(str) > 0 if ReExec(fpat, str, 'vars', 'p') then do i = i + 1 parse var vars.!match start len csv.i = substr(str, start, start+len-1) /* skip over that field */ str = substr(str, start+len+1) end else leave end csv.0 = i do i = 1 to csv.0 say i '?'csv.i'?' end Although to get the same results, I needed to delete the space following Chemical:, -- Patrick TJ McPhee East York Canada
|
Wed, 19 Oct 2005 03:21:49 GMT |
|
 |
T-Re #5 / 6
|
 Regular Expressions in Rexx to crack a comma delimited file
Patrick You asked: Quote: > Question: with tawk, do you just have to set FPAT, or do you have to > set FPAT and FS? If the latter, the solution would be to pass two REs > to the parse operation, one to match fields, and one to match delimiters > (currently, you have a choice of whether your one RE is supposed to match > fields (which I called v for values) or delimiters.
In Tawk, the manual recommends that when FPAT is used to set FS = 0. This probably implies that both can be active at the same time (?). This is something that I haven't considered. The normally active FS is what the manual calls "whitespace", i.e. any number of spaces or tabs between fields are considered as a single delimeter. Much the same as Rexx's Parse command. So far I haven't been setting FS to 0 when I have run my little Awk programs so I can't say for sure ift FS is active or not. I'll have to experiment! I'll have to think about what your are saying next...but the results look very promising. Now will the two types of RE be hierarchical in which one is interpreted first? I am thinking that if the first RE part is satisfied, then the RE moves on down the string to continue parsing. If not, then the next part of the RE is attempted. I can see extreme power in such an "Select-When" type of scheme. If I'm not mistaken, Rexx's Parse syntax does not allow for an FPAT type of selection, just on what the FS or delimiter is. Or does it? Rexx's Parse is so powerful and adaptable, I hesitate to make any strong statement of what it can or can't do. Quote: > Another solution which is more verbose, and closely tied to the number > of fields, but more reliable and probably faster is: > fieldre = '("[^"]*"|[^,"]*)' > delimre = ' *, *' > re = fieldre > /* 6 is one less than the number of fields expected */ > do 6 > re = re || delimre || fieldre > end > /* reComp that if it will be used repeatedly ... */ > if reParse(re, string, 'stx', 'fields') then > do i = 1 to fields.i > say i fields.i > end > which gives > 1 Chemical: > 2 "1,2-Dichlorobenzene Value" > 3 "1,2-Dichlorobenzene QA Flag" > 4 > 5 > 6 1.2345 > 7
Now I have to go keep 2 thir{*filter*} year-olds (no school today) from getting into trouble which would then get me into trouble with the Boss.. REX
|
Wed, 19 Oct 2005 03:49:06 GMT |
|
 |
Patrick TJ McPh #6 / 6
|
 Regular Expressions in Rexx to crack a comma delimited file
% In Tawk, the manual recommends that when FPAT is used to set FS = 0. This % probably % implies that both can be active at the same time (?). This is something % that I haven't considered. I think I've come up with my solution. The problem was this: my parse routine can match fields three ways -- the RE can match the field delimiter, in which case it does something like this: match the first delimiter assign everything to the left to field 1 delete up to the end of the match from the search string match the second delimiter etc or it can match the field itself: match the first field assign the match to field 1 delete up to the end of the match from the search string match the second field etc both of these fall apart when the the RE can match a zero-length string, since RE matching works on the basis of `first, longest match'. That is, when you start matching from the end of the field, it will either match a new field starting right after the first field, or it will match a zero-length string starting right after the first field. What I need to do is ensure that the end-of-match pointer moves between two matches, and ignore cases where that doesn't happen. (The third way of matching is to map parenthesised sub-expressions to fields, where this problem can't occur). % I'll have to think about what your are saying next...but the results look % very promising. Now will % the two types of RE be hierarchical in which one is interpreted first? I don't think I understand the question. If you refer to `basic' vs `extended' regular expressions, they're slightly different syntaxes specified by posix, and you have to tell the library which one you're typing. The `x' flag in ReComp, ReExec, and ReParse does this. % I am % thinking that if the first RE part % is satisfied, then the RE moves on down the string to continue parsing. If % not, then the next part of the % RE is attempted. I can see extreme power in such an "Select-When" type of % scheme. I really don't understand the question. This seems to be about parenthesised sub-expressions. Suppose I have this RE: ' *([a-z]+) +([a-z]+)' That will match zero or more spaces, one or more letters, one or more spaces, and one or more letters. If the input string doesn't have a substring of this nature, the RE won't match. If it does, the first group of letters is field 1, and the second group is field 2 (in the case of ReParse). % If I'm not mistaken, Rexx's Parse syntax does not allow for an FPAT type of % selection, just on what % the FS or delimiter is. Or does it? [talking about the real rexx parse instruction] You can specify only delimiters and positions within the string being parsed. -- Patrick TJ McPhee East York Canada
|
Thu, 20 Oct 2005 04:19:31 GMT |
|
|
|