Regexps and anchoring again 
Author Message
 Regexps and anchoring again

There was a discussion a few weeks back about Ruby's handling of ^ and $ in
regexps, and I have realised what may me so uncomfortable with it. I'm used
to matching strings on /^...$/ to mean "match exactly this", and it doesn't
work. In fact it could lead to very {*filter*} security holes. Consider this
example:

       str = cgi['unsafe_item']
       str.untaint if str =~ /^[a-z0-9]+$/

Looks perfectly safe, doesn't it? Errm, no.

       str = "rf -rf /*\nabcde\ndrop table master_db;"
       puts "oops!" if str =~ /^[a-z0-9]+$/   #>> "oops!"

For this to be safe, you actually have to write:

      str.untaint if str =~ /\A[a-z0-9]+\z/

The asymmetry between \A and \z is annoying (I have to keep looking it up to
remember which one is capital and which is lower-case), and it leaves
regular expressions looking a lot less readable.

I guess this is fixed in concrete now, but I thought it was pointing this
out as potentially a very important "gotcha"

Cheers,

Brian.



Mon, 07 Nov 2005 18:47:14 GMT  
 Regexps and anchoring again
Hi --

Quote:

> There was a discussion a few weeks back about Ruby's handling of ^
> and $ in regexps, and I have realised what may me so uncomfortable
> with it. I'm used to matching strings on /^...$/ to mean "match
> exactly this", and it doesn't work. In fact it could lead to very
> {*filter*} security holes. Consider this example:

But... but... it's not like it's being kept a secret :-)  I guess
different regex systems do this differently.  sed, for example, treats
^...$ linewise, not stringwise:

  $ echo -e 'abc\ndef' | sed -e 's/^def$/ghi/'
  abc
  ghi

whereas Perl requires the /m modifer.  So there isn't already one
universal syntax outside of Ruby; there's always the need to adjust to
each language's view of things.  I refuse to cast Ruby as the villain
of the piece :-)

Quote:
> [...]
>       str.untaint if str =~ /\A[a-z0-9]+\z/

> The asymmetry between \A and \z is annoying (I have to keep looking
> it up to remember which one is capital and which is lower-case), and
> it leaves regular expressions looking a lot less readable.

You can probably use \Z in most cases; the only difference between \z
and \Z is that \Z anchors before a trailing newline, if there is one.

David

--
David Alan Black


Web:   http://www.*-*-*.com/ ~blackdav



Mon, 07 Nov 2005 19:09:04 GMT  
 Regexps and anchoring again



Quote:
> There was a discussion a few weeks back about Ruby's handling of ^ and $
in
> regexps, and I have realised what may me so uncomfortable with it. I'm
used
> to matching strings on /^...$/ to mean "match exactly this", and it
doesn't
> work. In fact it could lead to very {*filter*} security holes. Consider this
> example:

>        str = cgi['unsafe_item']
>        str.untaint if str =~ /^[a-z0-9]+$/

> Looks perfectly safe, doesn't it? Errm, no.

>        str = "rf -rf /*\nabcde\ndrop table master_db;"
>        puts "oops!" if str =~ /^[a-z0-9]+$/   #>> "oops!"

> For this to be safe, you actually have to write:

>       str.untaint if str =~ /\A[a-z0-9]+\z/

> The asymmetry between \A and \z is annoying (I have to keep looking it
up to
> remember which one is capital and which is lower-case), and it leaves
> regular expressions looking a lot less readable.

I always use uppercase, because that's a reasonable choice if you process
lines from a file like in

while ( line = gets ) do
  case line
    when /\Abegin\Z/
      ...
  end
end

\A and \Z might be even more mnemonic than ^ and $ if you think a moment
about it - but then, we're used to cryptic symbols. :-)

Quote:
> I guess this is fixed in concrete now, but I thought it was pointing
this
> out as potentially a very important "gotcha"

Yes, it really is.  But I would not blame regexp syntax.  Designing
applications that do potentially dangerous things with input from the
outside world should be crafted carefully anyway.

Regards

    robert



Mon, 07 Nov 2005 20:04:45 GMT  
 Regexps and anchoring again

Quote:

> > There was a discussion a few weeks back about Ruby's handling of ^
> > and $ in regexps, and I have realised what may me so uncomfortable
> > with it. I'm used to matching strings on /^...$/ to mean "match
> > exactly this", and it doesn't work. In fact it could lead to very
> > {*filter*} security holes. Consider this example:

> But... but... it's not like it's being kept a secret :-)

Well no, if you read the documentation in its entirety, and forget
everything you knew about regexps and Perl previously. But regexp handling
in Ruby cries out "Yes I'm like Perl! I have /regexp/ and =~ and $1,$2..."
and you have to read the small print - or in my case write broken programs -
to discover something as fundamental as start and end anchoring doesn't work
in the way that you expect.

"Way that I expect" comes from not only Perl, but also things like Exim
(which embeds PCRE, Perl-compatible Regular Expressions)

Quote:
> >       str.untaint if str =~ /\A[a-z0-9]+\z/

> > The asymmetry between \A and \z is annoying (I have to keep looking
> > it up to remember which one is capital and which is lower-case), and
> > it leaves regular expressions looking a lot less readable.

> You can probably use \Z in most cases; the only difference between \z
> and \Z is that \Z anchors before a trailing newline, if there is one.

I want to say unambiguously "start of string" and "end of string", with no
messing around. If I am validating a string which is going to be inserted
into another string later on, it's important to me whether the provided
value has or does not have a trailing newline.

Cheers,

Brian.



Mon, 07 Nov 2005 20:48:51 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. Scheme vs ML again and again and again and again

2. Scheme vs ML again and again and again and again

3. APL and J (again and again) (long)

4. APL and J (again and again)

5. do SED supports extended regexps?

6. 'SQL Server Login': Again, again, ...

7. static vs. dynamic typing (again, again)

8. Regexps in PARSE

9. RegExps and latin_1

10. Scheme vs ML again and again

11. newby asks: giving regexps as argument for funvtions

12. How to negate regexps in case?

 

 
Powered by phpBB® Forum Software