regular expression discussion

As classically formulated, a regular expression represents a boolean

valued function which takes a sequence of characters as an argument.

Which means that, considered as a black box, they're not all that

efficient or useful. So, either you get into the mechanics of regular

expression handling, or you build a meta-function [which takes a set

of regular expression/function pairs, and uses the regular expressions

to control what the arguments are for the functions].

Here, I'm going to get into the mechanics of regular expression

handling.

Let's take a regular expression for a floating point number. For

example: [-+]?[0-9]+([.][0-9]*)?([eE][-+]?[0-9]([0-9][0-9]?)?)?

Notation: [] denotes a set of characters, characters are listed

explicitly, or a range is indicated using a hypen

surrounded by the ends of the range.

? denotes 0 or 1 occurances of the previous mini-regexp

* denotes 0 or more occurances of the previous mini-regexp

+ denotes 1 or more occurances of the previous mini-regexp

() groups a sequence so that it may be treated as a mini-regexp

Getting rid of all the optional stuff, the above regular expression

would be: [0-9], which is equivalent to [0123456789]. Examination

reveals that there are four significant character sets in the above

expression:

sign [-+]

digit [0123456789]

decimal [.]

exponent [eE]

For completeness, we may also wish to consider the character set

'other' which is every character not in one of the above sets. Note

that in the general case, we come up with a disjoint partitioning of

the underlying character set. We might also wish to consider the 'end

of the sequence' as a member of a character set. I'm going to ignore

these issues, here.

Next, we break the regexp down in terms of states:

0 {0} empty

1 {1} looking for the sign

2 {2} looking for the required digit

3 {3} looking for the optional digits

4.1 {4} looking for the optional fraction

4.2 {5} looking for the optional exponent

4.1.1 {6} looking for the decimal of the optional fraction

4.1.2 {7} looking for the optional trailing digits of the optional

fraction

4.2.1 {8} looking for the 'E' of the optional exponent

4.2.2 {9} looking for the optional sign of the optional exponent

4.2.3 {10} looking for the first digit of the optional exponent

4.2.4 {11} looking for the optional second and third digit of the

optional exponent

4.2.4.1 {12} looking for the second digit of the optional exponent

digits

4.2.4.2 {13} looking for the third digit of the optional exponent

digits

{14} not looking for anything

{15} done

Here's the regexp again, annotated with state numbers (left side of

numbers align with left edge of relevant mini-regexp).

[-+]?[0-9]+([.][0-9]*)?([eE][-+]?[0-9]([0-9][0-9]?)?)?

0 2 4 7 5 9 10 11 14

1 3 6 8 12 13

If you've been following this (and I don't really blame anyone who

hasn't), you'll have noticed that state 4 is really just an alias for

step 6, because no characters are ever processed, to make the transition

from state 4 to 6.

Aliases are:

4=6

5=8

11=12

Of course, to build a state machine, you need to list state

transitions. There's two kind of transitions that can occur in this

sort of state machine. One is an automatic transition -- these occur

"between the times" when a character is processed. The other is a

character transition -- these occur when a character is received.

Allowable transitions are:

0 -> 1

0 -> 2

1 => 2 [+-]

2 => 3 [0-9]

3 => 3 [0-9]

3 -> 4 -> 6

3 -> 5 -> 8

6 => 7 [.]

6 -> 5 -> 8

7 => 7 [0-9]

7 -> 5 -> 8

8 => 9 [eE]

9 -> 10

9 => 10 [-+]

10 => 11 [0-9]

11 -> 12

12 => 13 [0-9]

13 => 14 [0-9]

Completion transitions are:

3 -> 15

7 -> 15

11 -> 15

13 -> 15

14 -> 15

I might represent state as a fif{*filter*} element boolean vector [ignoring

state 0, since this is not a mini-regexp]. The initial state would be

1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Or, maybe I want to get rid of the redundancy introduced by my aliased

states -- in which case that would become a 12 element boolean vector.

Receiving a character would be handled by choosing a boolean matrix

(15 by 15) and 'and.or'ing the boolean matrix with my previous state.

Specifying these matrices from a base of a matrix with all 0s, the

transitions which would be set to one are:

[+-]

1 2

9 10

[0-9]

2 3; (2 4; 2 6; 2 5; 2 8; 2 15)

3 3; (3 5; 3 6; 3 5 3 8; 3 15)

7 7; (7 5; 7 8; 7 15)

10 11; (10 15; 11 12)

12 13; (12 15)

13 14; (13 15)

[.]

6 7; (6 5; 6 8; 7 15)

[eE]

8 9; (8 10)

If, after processing a sequence of characters, there is a 1 in the

position for state 15, then the sequence of characters is an instance

of the described regular expression. [Exercise for reader: write the

code to implement this boolean valued function. Try it out on a few

representations of floating point numbers.]

Raul D. Miller