regexp operators 
Author Message
 regexp operators

Hi folks,

In doing some work with parsers the other day, I ended up with a
situation where I wanted to embed and combine regexps easily,
perserving any flags that they may have on them (such as
case-sensitivity, etc).

So I whipped this out really quick. It lets you combine regexp by adding
them (+), ORing them (|) or by including them /blah#{some_regexp}blah/
inside each other, while preserving all flags. (At least, all flags
that can be preserved--I can't do anything about the encodings if they
mismatch... but I use all UTF-8 regexps, so I always use 'u'). This
could potentially be extended to support other operations--but I
haven't needed them yet on the project I'm tinkering with.

Maybe this is useful or interesting to somebody besides me. ;)

class Regexp

  def options

  end

  def to_s

    if $2.length > 0
      "(?#{$2}:#{$1})"
    else
      "(?:#{$1})"
    end
  end

  def |(other)
    /#{self}|#{other}/u
  end

  def +(other)
    /#{self}#{other}/u
  end
end

Super simple, but has been endlessly handy for me:

$ irb
irb(main):001:0> require 'RegexpOps'
=> true
irb(main):002:0> /bar/ + /foo/
=> /(?:bar)(?:foo)/u
irb(main):003:0> /bar/i + /foo/
=> /(?i:bar)(?:foo)/u
irb(main):004:0> /bar/i + /foo/m
=> /(?i:bar)(?m:foo)/u
irb(main):005:0> /bar/ix + /foo/m
=> /(?ix:bar)(?m:foo)/u
irb(main):006:0> baz = /bar/i + /foo/
=> /(?i:bar)(?:foo)/u
irb(main):007:0> %r!foo#{/bar/i}*#{baz}?!
=> /foo(?i:bar)*(?:(?i:bar)(?:foo))?/
irb(main):008:0> /test/mi.options
=> "mi"

And stuff like that. =)

--

OpenPGP FP: C99E DF40 54F6 B625 FD48  B509 A3DE 8D79 541F F830



Wed, 16 Nov 2005 02:26:53 GMT  
 regexp operators
Hi,

At Sat, 31 May 2003 03:26:53 +0900,

Quote:

> So I whipped this out really quick. It lets you combine regexp by adding
> them (+), ORing them (|) or by including them /blah#{some_regexp}blah/
> inside each other, while preserving all flags. (At least, all flags
> that can be preserved--I can't do anything about the encodings if they
> mismatch... but I use all UTF-8 regexps, so I always use 'u'). This
> could potentially be extended to support other operations--but I
> haven't needed them yet on the project I'm tinkering with.

Flags are preserved in 1.8.

$ ruby -v -e 'p(/#{/foo/m}/)'
ruby 1.8.0 (2003-05-31) [i686-linux]
/(?m-ix:foo)/

--
Nobu Nakada



Wed, 16 Nov 2005 07:41:03 GMT  
 regexp operators

Quote:
> Hi,

> At Sat, 31 May 2003 03:26:53 +0900,


> > So I whipped this out really quick. It lets you combine regexp by
> > adding them (+), ORing them (|) or by including them
> > /blah#{some_regexp}blah/ inside each other, while preserving all
> > flags. (At least, all flags that can be preserved--I can't do
> > anything about the encodings if they mismatch... but I use all
> > UTF-8 regexps, so I always use 'u'). This could potentially be
> > extended to support other operations--but I haven't needed them yet
> > on the project I'm tinkering with.

> Flags are preserved in 1.8.

> $ ruby -v -e 'p(/#{/foo/m}/)'
> ruby 1.8.0 (2003-05-31) [i686-linux]
> /(?m-ix:foo)/

I hadn't tried this in 1.8 yet, so that's cool to see that it has built
in a to_s similar to the one I posted. Nice! =)

Mine also does the + and | operators, as well, though; I'm not sure if
that's universally useful.

Looks like 1.8 still doesn't catch encoding flag in this case; there  
doesn't appear to be any '(?' prefix that changes encodings, though,
which would be a prerequisite. (Personally, I'm happy with UTF-8. ;)

$ ruby -v -e 'p(/#{/foo/mu}/)'
ruby 1.8.0 (2003-05-31) [i686-linux]
/(?m-ix:foo)/

--

OpenPGP FP: C99E DF40 54F6 B625 FD48  B509 A3DE 8D79 541F F830



Wed, 16 Nov 2005 07:59:45 GMT  
 regexp operators
Hi,

At Sat, 31 May 2003 08:59:45 +0900,

Quote:

> Mine also does the + and | operators, as well, though; I'm not sure if
> that's universally useful.

As for +, is it right to just concatinate them?  Regexp#| is
provided in lib/eregex.rb.  And you can see also
<http://member.nifty.ne.jp/nokada/archive/reop.rb>.

Quote:
> Looks like 1.8 still doesn't catch encoding flag in this case; there  
> doesn't appear to be any '(?' prefix that changes encodings, though,
> which would be a prerequisite. (Personally, I'm happy with UTF-8. ;)

Current regexp engine (and perhaps Oniguruma too) can not mix
encodings.  Well, would it be better to preserve it and raise
an exception when it doesn't match?

--
Nobu Nakada



Wed, 16 Nov 2005 15:28:45 GMT  
 regexp operators

Quote:
> Hi,

> At Sat, 31 May 2003 08:59:45 +0900,


> > Mine also does the + and | operators, as well, though; I'm not sure
> > if that's universally useful.

> As for +, is it right to just concatinate them?  Regexp#| is
> provided in lib/eregex.rb.  And you can see also
> <http://member.nifty.ne.jp/nokada/archive/reop.rb>.

Well, + meaning concatination makes sense to me. What else would it
mean? Notice that I do put regexps in (?:) groups so that you don't
have any ambiguity if you do something like:

/foo|bar/ + /.*/   # => /(?:foo|bar)(?:.*)/u  

(vs. getting /foo|bar.*/ which would be, I think, not what you expected,
especially if the regexps were extremely complex)

I wasn't aware that there were so several other regexp-operators
packages. Must be a good idea if so several different people have also
thought of it. ;)

One thing that's missing from the packages you point at is that the
object you get back isn't completely usable as a regexp. They could be
extended to have the missing methods, of course, but they don't
currently support them. And if you've added or modified any methods in
regexp, these objects are of a different type (and aren't class
descendants) so won't have the changes applied to them (say, if I
redefine to_s or source or something like that)

i.e.:
irb(main):001:0> require 'eregex'
=> true
irb(main):002:0> x = /foo/ | /bar/

irb(main):003:0> /test/.methods - x.methods
=> ["casefold?", "|", "source", "&", "~", "match", "kcode"]

Anyway, looks like eregex & is pretty handy; and your reop.rb looks even
better, but for me, I think mine is a lot more useful in that it is
totally transparent: when you do an operation on regexps, you get a
regexp back. It doesn't create an object hierarchy as the other two you
cited do; I toyed with that idea, but I didn't like it because I got
objects back that behaved differently than regexps and couldn't be
easily redefined without having some intimate knowledge of the operator
package.

BTW, I never wrote '&' because I didn't really need it, but it could be
done with something like this:

In RegexpOps.rb:
# the other code I posted goes here
class Regexp
  def &(other)
    /(?=#{self})#{other}/u
  end
end

Then:
irb(main):001:0> require 'RegexpOps'
=> true
irb(main):002:0> /foo/ & /bar/
=> /(?=(?:foo))(?:bar)/u

Of course, that regexp will never match anything, but you get the idea.
;)

Quote:
> > Looks like 1.8 still doesn't catch encoding flag in this case;
> > there doesn't appear to be any '(?' prefix that changes encodings,
> > though, which would be a prerequisite. (Personally, I'm happy with
> > UTF-8. ;)

> Current regexp engine (and perhaps Oniguruma too) can not mix
> encodings.  Well, would it be better to preserve it and raise
> an exception when it doesn't match?

For me, the encodings are not a problem, as I only use UTF-8; I do a lot
of multilingual stuff, and UTF-8 is the only way I can support English,
French, Spanish, German, and Japanese (strange mix, but those are the
languages I work with!) simultaneously in Ruby.

In general, though, it seems like it would be a good idea to catch
attempts at mixing encodings and throw an exception if they are
incompatible. I might add that to mine.

--

OpenPGP FP: C99E DF40 54F6 B625 FD48  B509 A3DE 8D79 541F F830



Wed, 16 Nov 2005 23:10:15 GMT  
 regexp operators
Hi,

At Sun, 1 Jun 2003 00:10:15 +0900,

Quote:

> Well, + meaning concatination makes sense to me. What else would it
> mean? Notice that I do put regexps in (?:) groups so that you don't
> have any ambiguity if you do something like:

> /foo|bar/ + /.*/   # => /(?:foo|bar)(?:.*)/u  

One possibility:
=> /(?:foo|bar).*(?:.*)/u

Quote:
> One thing that's missing from the packages you point at is that the
> object you get back isn't completely usable as a regexp. They could be
> extended to have the missing methods, of course, but they don't
> currently support them. And if you've added or modified any methods in
> regexp, these objects are of a different type (and aren't class
> descendants) so won't have the changes applied to them (say, if I
> redefine to_s or source or something like that)

Yes, I know it's a problem.  Not only you mentioned, some
methods of String expect Regexp instance.

Quote:
> BTW, I never wrote '&' because I didn't really need it, but it could be
> done with something like this:

> In RegexpOps.rb:
> # the other code I posted goes here
> class Regexp
>   def &(other)
>     /(?=#{self})#{other}/u
>   end
> end

Seems nice.

Quote:
> Of course, that regexp will never match anything, but you get the idea.
> ;)

Maybe, /(?=.*(?:foo)).*(?:bar)/?

--
Nobu Nakada



Thu, 17 Nov 2005 00:35:15 GMT  
 regexp operators

Quote:
> Hi,

> At Sun, 1 Jun 2003 00:10:15 +0900,


> > Well, + meaning concatination makes sense to me. What else would it
> > mean? Notice that I do put regexps in (?:) groups so that you don't
> > have any ambiguity if you do something like:

> > /foo|bar/ + /.*/   # => /(?:foo|bar)(?:.*)/u

> One possibility:
> => /(?:foo|bar).*(?:.*)/u

I tend to think of + for regexp to be like + for strings--just a simple
concatenation. However, I can see how the above would make sense, if
you think of "a + b" as meaning more like "match a, and then match b;
what's in between doesn't really matter".

If you did that, you'd need some way to do the "a directly followed by
b" semantics. I suppose << would work for that, though. =)

/foo|bar/ << /.*/
=> /(?:foo|bar)(?:.*)/

--

OpenPGP FP: C99E DF40 54F6 B625 FD48  B509 A3DE 8D79 541F F830



Thu, 17 Nov 2005 03:17:26 GMT  
 
 [ 7 post ] 

 Relevant Pages 

1. regexp operators `{N}'

2. shortest match regexp operator anyone?

3. shortest match regexp operator anyone?

4. tclreadline and "!" (was Re: shortest match regexp operator anyone?)

5. regexp look-behind operator

6. regexp or operator breaks greediness mitigators?

7. non-greedy operator fails in regexp

8. operator overloading and user definable operators.

9. Overloading logical operators and bitwise operators

10. OPERATOR (was: Why no ** operator in Modula2?)

11. Aliases for operators, creating new operators

12. Operator overload of base operator and compiler diagnostic

 

 
Powered by phpBB® Forum Software