Checking strings for "bad" characters 
Author Message
 Checking strings for "bad" characters

I've got some very long Unicode strings which I wish to test for the presence of ASCII characters 0-8 and 14-31. My first thought was to use regular expressions, e.g.:

import re
r = re.compile(u'[%s%s]' % (''.join([unichr(x) for x in range(0, 9)]) , ''.join([unichr(x) for x in range(14, 32)])))
amatch = r.search(r)
if amatch:
    print "Bad characters"
else:
    print "OK"

but is there a better or faster method.

TIA

Harvey

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.



Sat, 12 Feb 2005 21:52:02 GMT  
 Checking strings for "bad" characters
This should be reasonably efficient.  For instance, it should beat the
approach of
        for c in range(0, 9) + range(14, 32):
                if unichr(c) in long_unicode_string
since this version searches the string many times.

If matching on unicode strings is too much slower than matching on regular
strings, maybe a sequence like
        long_utf8_string = long_unicode_string.encode("utf-8")
        r.search(long_utf8_string)
i.e., does the possibly faster search make up for the possibly slow
conversion?

My testing suggests that this is not so.  See program below..
unicode 3.69167995453
unicode re 3.7019649744
utf-8 6.15347003937

(unicode vs unicode re shows that it makes no difference that the RE string
is given in a unicode or 8-bit string)

Jeff

import time, re

us = "\u0100" * 1000 * 1000
r = re.compile('[%s]' % ''.join([chr(x) for x in range(0, 9) + range(14, 32)]))
ru = re.compile(u'[%s]' % ''.join([chr(x) for x in range(0, 9) + range(14, 32)]))

t = time.time()
for i in range(3):
        r.search(us)
print "unicode", time.time() - t

t = time.time()
for i in range(3):
        ru.search(us)
print "unicode re", time.time() - t

t = time.time()
for i in range(3):
        s = us.encode("utf-8")
        r.search(s)
print "utf-8", time.time() - t



Sat, 12 Feb 2005 22:36:52 GMT  
 Checking strings for "bad" characters

Quote:

> I've got some very long Unicode strings which I wish to test for the presence of ASCII characters 0-8 and 14-31. My first thought was to use regular expressions, e.g.:

> import re
> r = re.compile(u'[%s%s]' % (''.join([unichr(x) for x in range(0, 9)]) , ''.join([unichr(x) for x in range(14, 32)])))
> amatch = r.search(r)
> if amatch:
>     print "Bad characters"
> else:
>     print "OK"

> but is there a better or faster method.

If you could use string.maketrans and .translate() to convert all bad characters
that might be present into a single code (e.g. \x00), and then do a simple
.find() for that character, you might get the benefits of simplicity and extreme
speed.

-Peter



Sun, 13 Feb 2005 11:15:14 GMT  
 Checking strings for "bad" characters

Quote:


> > I've got some very long Unicode strings which I wish to
> test for the presence of ASCII characters 0-8 and 14-31. My
> first thought was to use regular expressions, e.g.:

> > import re
> > r = re.compile(u'[%s%s]' % (''.join([unichr(x) for x in
> range(0, 9)]) , ''.join([unichr(x) for x in range(14, 32)])))
> > amatch = r.search(r)
> > if amatch:
> >     print "Bad characters"
> > else:
> >     print "OK"

> > but is there a better or faster method.

> If you could use string.maketrans and .translate() to convert
> all bad characters
> that might be present into a single code (e.g. \x00), and
> then do a simple
> .find() for that character, you might get the benefits of
> simplicity and extreme
> speed.

> -Peter

Thanks for the suggestion Peter, but it's much slower than the RE. I guess it's because of the creation of a new string which is then scanned. I suppose the only thing faster than the RE is a C extension, but I don't think the effort needed for that is worthwhile.

Harvey

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.



Sun, 13 Feb 2005 16:41:31 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. Check Characters "cc"

2. Check "Evil" Characters

3. reading strings with embedded "/" characters

4. binary scan: "character string"?

5. string.join(["Tk 4.2p2", "Python 1.4", "Win32", "free"], "for")

6. '"""' and linefeed characters

7. string match fails for strings containing "["

8. Printing "bad" font

9. "ep": Bad News

10. "Bad Managers" website

11. "bad file descriptor" from tcp sockets

12. "bad synchronous description" in Xilinx WebPack

 

 
Powered by phpBB® Forum Software