Partial RegExp's 
Author Message
 Partial RegExp's

Here's a question I've been pondering for a few weeks:

Inherent in the design of Perl seems to be the idea that you always have
enough space in memory for the entirety of any file you'd ever want to
work on. But what if you don't? How does one do wonderful things like
"split" and various pattern matches & such on files that are just too big?

Well, if there's already a standard, ok-I'll-spell-it-all-out-for-a-
neophyte-like-you-but-next-time-read-the-book-okay answer to this
question, I'd like to hear it. If, not, an idea:

Of course, the way you deal with huge files is to read them in chunks.
The problem with that is that you miss a pattern match that starts on
one chunk and ends on another.

Currently, the result of an attempt at a pattern match gives one of two
responses: "Got a match" or "Didn't get a match". What if there were a
way to tell the pattern matcher that some patterns may extend off the
end of the currently available text, and we gave the matcher the ability
to give two other reponses: "Got a partial match, and if you give me
more data, I may get a match" and "Got a match, which may turn into a
bigger match if you give me more data". The matcher would also return
the place at which the partial match began.

Now, from what I know of pattern matching, it seems to me that this would
be an easy modification to do. The question, then, is whether it would
be worthwhile. So, does anyone think so? Or is it just me?

                                Glenn Chappell  <><



Fri, 28 Apr 1995 23:11:20 GMT  
 Partial RegExp's

Maintaing search-state is also the desired approach for programs which
read in chunks because only chunks are available -- chat2 / expect,
for example.  This is also a big help for asynchronous programs which
want to match against several chunky streams simultaneously.




Sun, 30 Apr 1995 00:28:49 GMT  
 Partial RegExp's
: Here's a question I've been pondering for a few weeks:
:
: Inherent in the design of Perl seems to be the idea that you always have
: enough space in memory for the entirety of any file you'd ever want to
: work on. But what if you don't? How does one do wonderful things like
: "split" and various pattern matches & such on files that are just too big?
:
: Well, if there's already a standard, ok-I'll-spell-it-all-out-for-a-
: neophyte-like-you-but-next-time-read-the-book-okay answer to this
: question, I'd like to hear it. If, not, an idea:
:
: Of course, the way you deal with huge files is to read them in chunks.
: The problem with that is that you miss a pattern match that starts on
: one chunk and ends on another.
:
: Currently, the result of an attempt at a pattern match gives one of two
: responses: "Got a match" or "Didn't get a match". What if there were a
: way to tell the pattern matcher that some patterns may extend off the
: end of the currently available text, and we gave the matcher the ability
: to give two other reponses: "Got a partial match, and if you give me
: more data, I may get a match" and "Got a match, which may turn into a
: bigger match if you give me more data". The matcher would also return
: the place at which the partial match began.
:
: Now, from what I know of pattern matching, it seems to me that this would
: be an easy modification to do. The question, then, is whether it would
: be worthwhile. So, does anyone think so? Or is it just me?

The main problem is the semantics of *, + and {}.  Currently these want
to match the LONGEST possible string.  Some patterns would have to
slurp in the entire file anyway.

This is one of the reasons that Mark and I have been talking about
variants of *, + and {} that would match the shortest string possible.
Currently there are only two ways to do first-match rather than last-match:
1) have the thing you're looking for be the first thing in a pattern, or
2) carefully write the intermediate stuff to exclude the pattern you're
looking for.  Both of these approaches leave something to be desired.

A thing that might help with option 1 is if we attached the state of
m//g searches to the searched string rather than to the pattern
itself.  This would let you switch from one pattern to another in mid
search.  One could then do tokenizing without s///, for instance.  Of
course, that doesn't help the slurp-in-more-file-if-you-run-out problem.

I don't see any theoretical problems with regular expression operators
that backtrack by attempting to match more rather than less.  There's
certainly no problem with oscillation, since any given operator always
progresses in the same direction.

As for the feature in question, it's just one of those things where you
try to figure out the most general (and practical) way to do it, then
think about whether you want to do it at all, and then maybe find time
to do it.  None of these is a trivial undertaking.  Language design is
not a haphazard activity, appearances notwithstanding.  :-)

Larry



Mon, 01 May 1995 08:30:38 GMT  
 Partial RegExp's


Quote:
>> Of course, the way you deal with huge files is to read them in chunks.
>> The problem with that is that you miss a pattern match that starts on
>> one chunk and ends on another.

My mg (multi-line grep) script is doing pattern match like
this:

********************
               *************************
                                   *************************

Of course, there possibly be missing matches but this method
works fine for my purpose.  Also it is true that making this
kind of program is very tough.  I had to spend a lot of time
to debug it.

--utashiro



Mon, 01 May 1995 12:43:51 GMT  
 Partial RegExp's
Quote:
lwall writes:


  : Here's a question I've been pondering for a few weeks:
  :
  : Inherent in the design of Perl seems to be the idea that you always have
  : enough space in memory for the entirety of any file you'd ever want to
  : work on. But what if you don't? How does one do wonderful things like
  : "split" and various pattern matches & such on files that are just too big?

Yes, I know what you mean.  For me it is very tempting to read in a whole
file, split on white space, then unshift to get tokens.  Which is silly
unless you are looking to match disparate token sequences, such as across
line boundaries or deep in a parse.

  : Well, if there's already a standard, ok-I'll-spell-it-all-out-for-a-
  : neophyte-like-you-but-next-time-read-the-book-okay answer to this
  : question, I'd like to hear it. If, not, an idea:

Not really, but there are the usual ways to break up input.

  : Of course, the way you deal with huge files is to read them in chunks.
  : The problem with that is that you miss a pattern match that starts on
  : one chunk and ends on another.
  :
  : Currently, the result of an attempt at a pattern match gives one of two
  : responses: "Got a match" or "Didn't get a match". What if there were a
  : way to tell the pattern matcher that some patterns may extend off the
  : end of the currently available text, and we gave the matcher the ability
  : to give two other reponses: "Got a partial match, and if you give me
  : more data, I may get a match" and "Got a match, which may turn into a
  : bigger match if you give me more data". The matcher would also return
  : the place at which the partial match began.

A better solution would probably be to write a fsm package (finite-state
machine) or more specifically a little parser package.  Which reminds me,

Although I haven't used it, the author describes it as a "prototype of
a hacker's parser for perl."

  : Now, from what I know of pattern matching, it seems to me that this would
  : be an easy modification to do. The question, then, is whether it would
  : be worthwhile. So, does anyone think so? Or is it just me?

If that is your specific goal then you may want to do something
simpler, such as breaking your expression into two (if you know where
you can expect boundaries.)  Let's say you'd like to replace "foo bar"
with "moo man" across whitespace boundaries:

#!/bin/perl
$reg1='foo';
$reg2='bar';

&prog while(<>);

sub prog {
    if ($_ =~ /$reg1/) {
        $_ .= <>;
        s/$reg1([ \t\n]+)$reg2/moo\1man/g;
        &prog;
    } else { print; }

Quote:
}

[damn, there I go recursing again.  If it's not an eval its recursion]

  lw> This is one of the reasons that Mark and I have been talking about
  lw> variants of *, + and {} that would match the shortest string
  lw> possible.  Currently there are only two ways to do first-match
  lw> rather than last-match: 1) have the thing you're looking for be the
  lw> first thing in a pattern, or 2) carefully write the intermediate
  lw> stuff to exclude the pattern you're looking for.  Both of these
  lw> approaches leave something to be desired.

  lw> A thing that might help with option 1 is if we attached the state
  lw> of m//g searches to the searched string rather than to the pattern
  lw> itself.  This would let you switch from one pattern to another in
  lw> mid search.  One could then do tokenizing without s///, for
  lw> instance.  Of course, that doesn't help the
  lw> slurp-in-more-file-if-you-run-out problem.

One thing that might be a simple/useful change to perl would be
to let one assign a regex to $/ for doing on-the-fly parsing.
(to remain compatible with old scripts, this magic wouldn't turn
on unless you did a 'study $/')

  lw> I don't see any theoretical problems with regular expression
  lw> operators that backtrack by attempting to match more rather than
  lw> less.  There's certainly no problem with oscillation, since any
  lw> given operator always progresses in the same direction.

This is probably one of those things that can be proven to be an
equivalent set of operations, where providing some operators of the
other set is extraneous, but possibly convenient.

  lw> As for the feature in question, it's just one of those things where
  lw> you try to figure out the most general (and practical) way to do
  lw> it, then think about whether you want to do it at all, and then
  lw> maybe find time to do it.  None of these is a trivial undertaking.
  lw> Language design is not a haphazard activity, appearances
  lw> notwithstanding.  :-)

/charles
==============================================================================
Charles Thayer, Columbia University, Dept. CS, Tech-staff... my words only.



Sun, 07 May 1995 07:06:32 GMT  
 Partial RegExp's

Quote:

>One thing that might be a simple/useful change to perl would be to let one
>assign a regex to $/ for doing on-the-fly parsing.  (to remain compatible
>with old scripts, this magic wouldn't turn on unless you did a 'study $/')

That would be a good place to use a "version" directive:

        version 5.1;    # enable RE's in $/

Michael.



Sun, 07 May 1995 23:22:04 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. Regexp: matching '|'

2. partial buffered writes

3. Partial Success installing Perl on Solaris 2.2

4. Partial matching of regexps

5. Accessing hash with partial key

6. Detecting success on partial match with capture?

7. Partial URL Reads

8. partial search on a whois server

9. partial translation of webpages

10. Partial matching of a regular expression

11. Net::FTP How to resume partial download?

 

 
Powered by phpBB® Forum Software