how to define alphanumeric (words containing non-ASCII chars) 
Author Message
 how to define alphanumeric (words containing non-ASCII chars)

The term "alphanumeric" (regular expression \w) is defined as
[0-9_A-Za-z], i.e. the alphanumeric characters of the *English*
language.  For processing other languages using ISO-8859 code, one
would need [0-9_A-Za-z\300-\326\330-363\370-\377].

For \w, this is no problem but emulating the effect of \b is rather
tedious.  Hence my question:

Is there a way in perl to define the set of characters that is
considered alphanumeric, regarding regular expressions \b and \w?

Thanks for any help.

Helmut Richter



Mon, 01 Jun 1998 03:00:00 GMT  
 how to define alphanumeric (words containing non-ASCII chars)

Quote:

>The term "alphanumeric" (regular expression \w) is defined as
>[0-9_A-Za-z], i.e. the alphanumeric characters of the *English*
>language.  For processing other languages using ISO-8859 code, one
>would need [0-9_A-Za-z\300-\326\330-363\370-\377].
>For \w, this is no problem but emulating the effect of \b is rather
>tedious.  Hence my question:
>Is there a way in perl to define the set of characters that is
>considered alphanumeric, regarding regular expressions \b and \w?

Not easily. Perl ultimately depends on your C compiler's idea of isalpha
for its \w class.

You can simulate \b, using zero-width assertions. It's long-winded, but
it works. You normally know whether it's a word ending or beginning
you're aiming for. Here's a word ending for German:

        [\w?????]+(?=[^\w?????]|$)

Ian



Tue, 02 Jun 1998 03:00:00 GMT  
 how to define alphanumeric (words containing non-ASCII chars)

: The term "alphanumeric" (regular expression \w) is defined as
: [0-9_A-Za-z], i.e. the alphanumeric characters of the *English*
: language.  For processing other languages using ISO-8859 code, one
: would need [0-9_A-Za-z\300-\326\330-363\370-\377].
:
: For \w, this is no problem but emulating the effect of \b is rather
: tedious.  Hence my question:
:
: Is there a way in perl to define the set of characters that is
: considered alphanumeric, regarding regular expressions \b and \w?

Yes, use POSIX::setlocale().  In fact, 5.002beta1 will call setlocale()
for you.

Larry



Sat, 06 Jun 1998 03:00:00 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Filtering ASCII from non-ASCII chars in PERL-CGI

2. Filtering ASCII from non-ASCII chars in PERL-CGI

3. Char position of 1st non-word char in a string

4. split problem with non-ascii chars

5. Need an expression to strip non-alphanumeric except space

6. Regular expression to check for non-alphanumeric?

7. Detecting string containing no printing chars ?

8. convert a Perl AV* (containing strings) to a C char**

9. Regexp containing plus-chars won't work?!?!?

10. print ASCII char

11. parsing ascii backspace chars.

12. removing ASCII chars < 32

 

 
Powered by phpBB® Forum Software