regular expression matching in J ? (or APL) 
Author Message
 regular expression matching in J ? (or APL)

  Does anybody know how to do regular expression matching in J or APL?

  Ed Cherlin had mentioned that A.J. Perlis had written in some book how to
  do them in APL.  Ed couldn't remember the name of the book though.
  Does anybody else know this reference?

  How would you approach this?

Thanks, Mark

--
 Mark Keil               HP/Apollo Computer,  Chelmsford MA 01824



Mon, 01 Aug 1994 05:38:21 GMT  
 regular expression matching in J ? (or APL)
Mark Keil:
     Does anybody know how to do regular expression matching in J or APL?

Um.. well, sorta...

     Ed Cherlin had mentioned that A.J. Perlis had written in some
     book how to do them in APL.  Ed couldn't remember the name of the
     book though.  Does anybody else know this reference?

I've never seen or heard of this reference, but I'd be interested as
well if anyone knows of it.

     How would you approach this?

Well, that depends, are you talking about a rigorous theoretical
exposition of regular expression matching?  Or are you interested in
specific forms of token recognition?

Implementing a regular expression "set of characters" in j (or APL) is
easy.  In j, to recognize lower case letters, you could do
        Pa =. e. & 'abcdefghijklmnopqrstuvwxyz'

Or, for numeric characters
        P0 =. e. & '0123456789'

To get a regular expression which matches those strings which consist
of exactly that character, you could do:

To get the corresponding klein star, you could do

To get the klein star of an arbitrary regular expression, you could do

Finally, to catenate two regular expressions, you could do

The results of one of these "regular expression verbs" is a bit
stream, with 1s marking how much of the string the regular expression
matches.  And, this is all fine and dandy, except for one thing:  a
literal implementation of these algorithms becomes slow rather
quickly.

Now, one solution to this problem would be to write a program which,
given a set of regular expressions, comes up with a state machine
which will provide the same results when fed the string a character at
a time.  And, indeed, it is for precisely this solution that regular
expressions were invented -- traditional languages implement state
machines rather well.

 Aside: a state machine is an implementation of an array, where
 element A[n] depends on some finite sequence of elements A[n-k]
 through A[n-1].  Thus, the presence of adverbs like \ in a program
 should be an indication that that code might benefit from
 serialization.  [Serialization is a word I've made up to describe
 optimizing code for a "serial computer" -- and "serial computer" is a
 word which I prefer over "scalar computer" as it contrasts with
 "parallel computer".]

However, a much easier solution would be to not throw away all the
information we have available, and to rely on some of the implicit
parallelism available in j.  If nothing else, this avoids having to
rely on phantom "optimizing compilers" which are not yet available.

We'll still need some sort of "serial code", because we're looking for
_sequences_ of characters.  But, there are plenty operations in j with
some sort of "serial characteristics".  And, if we look for the two
character sequences which mark the edge of a token, we're probably way
ahead.

As a trivial example, here's a bit of code which recognizes
"alphanumeric words", and splits everything else up into unique
tokens:
    alph=. '01234546789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    shift =. |.!.0

Or, if you wanted to ignore all non-alphanumeric characters, you could
pull a stunt like

But, in general, you want to come up with a combination of predicates,
shifts and boolean logic operations to mark "token beginings".

Unless you can get by with j's built-in parser.  Then you don't have
to bother with any of this.  Even if the parser doesn't do exactly
what you want, sometimes it's easier to pre-process the original text
than it is to build yet another parser "from scratch".

--

The U.S. government went another thousand million dollars into debt today.



Mon, 01 Aug 1994 12:35:49 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. iss-matching - the free Regular Expression / Pattern Matching cluster

2. Regular Expression to match HTML elements

3. regular expression: matching ( )

4. Regular expression matching with Halstenbach's REGEXP

5. Regular Expression for Match Pattern (string) Function

6. Bug in regular expression pattern matching?

7. Binding style and the universality of REs (was: Regular Expression Matching)

8. Regular expression string pattern matching: Embedding pop-11 procedures, and more

9. regular expression matching

10. Regular Expression matching...

11. Pattern-matching regular-expression algorithm?

12. Regular Expression Pattern Matching "State" Object

 

 
Powered by phpBB® Forum Software