Regular Expression mystery REGEXPR
I'm working in PB/CC 2.0
PBCC .EXE A 206414 06-02-99 02:00a
which is the latest version, I believe. The 06-01-99 timestamp
version had a bug with regular expressions.
Maybe I'm not a regular guy, but I'm having trouble with this.
The input file is email messages. The lines are delimited with
linefeeds [hex 0a, chr$ (10), lf] not with the DOS standard
CR-LF.
I have isolated the header section and the body of each
message, each in its own string.
I want to extract certain of the headers (To: From: Date:
Subject: etc) so that I can write them to another file. The
slight complication is that email headers can have continuation
lines which start with whitespace (spaces and horizontal tabs
for sure). I want to extract the continuation lines together
with the original header line.
Using traditional BASIC methods (INSTR and MID$ would figure)
would not be too difficult, but I have ventured to baptize
myself in regular expressions and the REGEXPR statement
The online help gives the following Syntax:
REGEXPR mask$ IN main$ [AT start&] TO posvar&, lenvar&
Remarks REGEXPR scans main$ for a matching expression in mask$.
If found, it returns the position of the match in the posvar&
variable and the length of the matching expression in lenvar&.
Optionally, you can specify a starting position for the search
in start&. If no matching expression is found, both posvar&
and lenvar& are set to zero. By default, all search
expressions are assumed to be case-insensitive, which means
that "a" is equal to "A".
and so on.
In the array strip$ () I have set up the headers that I want to
extract.
To make a long story short, I have tried a lot of things that
didn't work, and I don't know why. But here is the crux of
the mystery:
REGEXPR mask$ IN header$ TO startmatch&, lenmatch&
target$ = MID$ (header$, startmatch&, lenmatch&)
STDOUT target$ & "---"
The following mask$ settings all give identical results:
mask$ = "\n" + strip$ (i&) + ".*\n[^\x20\x09]"
mask$ = "\n" + strip$ (i&) + ".*\n[^\x20^\x09]"
mask$ = "\n" + strip$ (i&) + ".*\n[\x20\x09]"
Map: \n is the token for linefeed (*not* for crlf)
strip$ (i&) is a wanted header, for example To, Subject etc.
. matches any character except end-of-line
(I have tested this and what end-of-line means is CRLF, and
that is consistent with the documentation)
* means match the . forever (non-greedily, it seems)
[] identifies a class of characters
^ inside the [] identifies the complement of the set of
characters, i.e. take any character NOT within the []
\x## identifies a hex code. so \x20 is <space> or chr$(32),
while \x09 is <tab> or chr$ (9).
OK, so the first mask finds a header at the beginning of a line,
scoops up characters until (and including) it finds a linefeed
followed by any character except a space or a tab.
What actually happens is as follows:
-- If the header does not exist in the header section, a null
string is returned. This is as expected!
-- If the header does exist in the header section, but has no
continuation line, a null string is returned.
-- If the header does exist, with a continuation line or lines,
the header line plus the white space of the continuation line
is returned.
What puzzles me most is that the first and the last mask$ give the
same result. It is as if the ^ complement operator were being
ignored. The middle mask$ maybe should not have any defined
meaning, I was trying it in desperation. Same result.
What am I doing wrong?
I searched for regular expression reference info on the
internet. The most concise I found was in Microsoft's Language
Reference, Visual Basic Scripting Edition, Regular Expression
Syntax. Unfortunately, M$'s implementation of regular
expressions seems different from powerbasic's. There seem to
be a lot of implementations of regular expressions that differ
substantially in detail. "Regular" does not mean "standard"!
I was not able to find much on regular expressions relating
specifically to PowerBASIC, either in the PB forums, or in the
samples provided with PB/CC, or on the www at large. I'd like
to see a lot of examples.
Please excuse the length of this message!
--
cheers
Jonathan Berry
http://www.*-*-*.com/ ~jberry/ to know more than you want