regular expression discussion 
Author Message
 regular expression discussion

As classically formulated, a regular expression represents a boolean
valued function which takes a sequence of characters as an argument.
Which means that, considered as a black box, they're not all that
efficient or useful.  So, either you get into the mechanics of regular
expression handling, or you build a meta-function [which takes a set
of regular expression/function pairs, and uses the regular expressions
to control what the arguments are for the functions].

Here, I'm going to get into the mechanics of regular expression
handling.

Let's take a regular expression for a floating point number.  For
example:  [-+]?[0-9]+([.][0-9]*)?([eE][-+]?[0-9]([0-9][0-9]?)?)?

Notation:  [] denotes a set of characters, characters are listed
               explicitly, or a range is indicated using a hypen
               surrounded by the ends of the range.
            ? denotes 0 or 1 occurances of the previous mini-regexp
            * denotes 0 or more occurances of the previous mini-regexp
            + denotes 1 or more occurances of the previous mini-regexp
           () groups a sequence so that it may be treated as a mini-regexp

Getting rid of all the optional stuff, the above regular expression
would be: [0-9], which is equivalent to [0123456789].  Examination
reveals that there are four significant character sets in the above
expression:

sign     [-+]
digit    [0123456789]
decimal  [.]
exponent [eE]

For completeness, we may also wish to consider the character set
'other' which is every character not in one of the above sets.  Note
that in the general case, we come up with a disjoint partitioning of
the underlying character set.  We might also wish to consider the 'end
of the sequence' as a member of a character set.  I'm going to ignore
these issues, here.

Next, we break the regexp down in terms of states:

0        {0}  empty
1        {1}  looking for the sign
2        {2}  looking for the required digit
3        {3}  looking for the optional digits
4.1      {4}  looking for the optional fraction
4.2      {5}  looking for the optional exponent
4.1.1    {6}  looking for the decimal of the optional fraction
4.1.2    {7}  looking for the optional trailing digits of the optional
                fraction
4.2.1    {8}  looking for the 'E' of the optional exponent
4.2.2    {9}  looking for the optional sign of the optional exponent
4.2.3    {10} looking for the first digit of the optional exponent
4.2.4    {11} looking for the optional second and third digit of the
                optional exponent
4.2.4.1  {12} looking for the second digit of the optional exponent
                digits
4.2.4.2  {13} looking for the third digit of the optional exponent
                digits
         {14} not looking for anything
         {15} done

Here's the regexp again, annotated with state numbers (left side of
numbers align with left edge of relevant mini-regexp).
    [-+]?[0-9]+([.][0-9]*)?([eE][-+]?[0-9]([0-9][0-9]?)?)?
0        2     4   7       5    9    10   11              14
    1    3      6            8             12   13

If you've been following this (and I don't really blame anyone who
hasn't), you'll have noticed that state 4 is really just an alias for
step 6, because no characters are ever processed, to make the transition
from state 4 to 6.

Aliases are:
4=6
5=8
11=12

Of course, to build a state machine, you need to list state
transitions.  There's two kind of transitions that can occur in this
sort of state machine.  One is an automatic transition -- these occur
"between the times" when a character is processed.  The other is a
character transition -- these occur when a character is received.

Allowable transitions are:
0 -> 1              
0 -> 2              
1 => 2              [+-]
2 => 3              [0-9]
3 => 3              [0-9]
3 -> 4 -> 6
3 -> 5 -> 8
6 => 7              [.]
6 -> 5 -> 8
7 => 7              [0-9]
7 -> 5 -> 8
8 => 9              [eE]
9 -> 10
9 => 10             [-+]
10 => 11            [0-9]
11 -> 12
12 => 13            [0-9]
13 => 14            [0-9]

Completion transitions are:
3 -> 15
7 -> 15
11 -> 15
13 -> 15
14 -> 15

I might represent state as a fif{*filter*} element boolean vector [ignoring
state 0, since this is not a mini-regexp].  The initial state would be
 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Or, maybe I want to get rid of the redundancy introduced by my aliased
states -- in which case that would become a 12 element boolean vector.

Receiving a character would be handled by choosing a boolean matrix
(15 by 15) and 'and.or'ing the boolean matrix with my previous state.
Specifying these matrices from a base of a matrix with all 0s, the
transitions which would be set to one are:

[+-]
1 2
9 10

[0-9]
2 3; (2 4; 2 6; 2 5; 2 8; 2 15)
3 3; (3 5; 3 6; 3 5 3 8; 3 15)
7 7; (7 5; 7 8; 7 15)
10 11; (10 15; 11 12)
12 13; (12 15)
13 14; (13 15)

[.]
6 7; (6 5; 6 8; 7 15)

[eE]
8 9; (8 10)

If, after processing a sequence of characters, there is a 1 in the
position for state 15, then the sequence of characters is an instance
of the described regular expression.  [Exercise for reader: write the
code to implement this boolean valued function.  Try it out on a few
representations of floating point numbers.]

Raul D. Miller



Sun, 06 Oct 1996 01:15:06 GMT  
 regular expression discussion
.  Receiving a character would be handled by choosing a boolean matrix
.  (15 by 15) and 'and.or'ing the boolean matrix with my previous state.
.  Specifying these matrices from a base of a matrix with all 0s, the
.  transitions which would be set to one are:

Make that 'or.and'ing.

oops...

Raul D. Miller



Sun, 06 Oct 1996 03:02:23 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. Regular expression

2. A regular expression problem

3. Newbie awk (sed??) question, regular expressions

4. regular expression to describe 16 degrees of wind direction

5. regular expressions

6. Variables in Regular Expressions

7. FPAT and regular expression subtleties

8. Variable as regular expression in pattern?

9. awk pattern as a variable with regular expression

10. regular expressions

11. regular expression support

 

 
Powered by phpBB® Forum Software