Converting numerical character references to Unicode 
Author Message
 Converting numerical character references to Unicode

Say I'm given a hexadecimal numerical character reference:
"?". How can I turn this into the equivalent unicode
character: "?"? It's not in XML or HTML data so I can't
use a parser. Something like print "\x{00A9}"; only works
at compile time, it doesn't take a variable as an argument.
I couldn't figure out how to do it using chr() as the
documentation suggests. Any help would be appreciated.


Tue, 22 Nov 2005 23:31:10 GMT  
 Converting numerical character references to Unicode

Quote:

> Say I'm given a hexadecimal numerical character reference:
> "©". How can I turn this into the equivalent unicode
> character: "?"? It's not in XML or HTML data so I can't
> use a parser. Something like print "\x{00A9}"; only works
> at compile time, it doesn't take a variable as an argument.
> I couldn't figure out how to do it using chr() as the
> documentation suggests. Any help would be appreciated.

One idea:

     my $string = '©';
     print chr oct $1 if $string =~ /&#(\w+);/;

/ Gunnar

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl



Wed, 23 Nov 2005 00:43:41 GMT  
 Converting numerical character references to Unicode

Quote:


> > Say I'm given a hexadecimal numerical character reference:
> > "?". How can I turn this into the equivalent unicode
> > character: "??"?

> One idea:

>     my $string = '?';
>     print chr oct $1 if $string =~ /&#(\w+);/;

> / Gunnar

Nah, doesn't work. It just prints out ? and not ??.


Wed, 23 Nov 2005 17:44:52 GMT  
 Converting numerical character references to Unicode

Quote:



>>> Say I'm given a hexadecimal numerical character reference:
>>> "?". How can I turn this into the equivalent unicode
>>> character: "??"?

>> One idea:

>>     my $string = '©';
>>     print chr oct $1 if $string =~ /&#(\w+);/;

> Nah, doesn't work. It just prints out ? and not ??.

That may be because the Unicode value 'x00A9' represents just the
character ?. Unicode for ? is 'x00C2'. I thought that the occurrence
of the ? character in your question was a typo. ;-)

/ Gunnar

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl



Wed, 23 Nov 2005 18:21:19 GMT  
 Converting numerical character references to Unicode
On Fri, Jun 6, Arvin Portlock inscribed on the eternal scroll:

| Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Quote:
> Say I'm given a hexadecimal numerical character reference:
> "©". How can I turn this into the equivalent unicode
> character: "??"?

I'm afraid we already have some evidence here that you're confused.
That's OK: it's a common enough condition, but I'd suggest you need
a bit of reading-around the topic first.

What you posted there was a pair of octets (bytes) which, due to the
posting's MIME header, allegedly represented a pair of iso-8859-1
8-bit characters.

What I suspect you _intended_ to post was a single Unicode character
represented in utf-8 encoding (this particular character needs two
octets in utf-8 coding: utf-8 characters can require one, two or more
octets, theoretically up to a maximum of six).

You need to know that a "unicode character" is a theoretical concept,
that can exist in various embodiments.

Current versions of Perl work internally with unicode characters, when
necessary.  When you want to express such a character externally, you
need to make it clear just how you want it coded.

I'd recommend starting with perluniintro,
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html
and if the answer to your question isn't then obvious,
you might want to dip into
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html

Even if you don't want to struggle that far - at least you'd be in a
better position to describe what you want to achieve (I was only
guessing when I surmised that you wanted utf-8-coded output).

Quote:
> Something like print "\x{00A9}"; only works
> at compile time, it doesn't take a variable as an argument.
> I couldn't figure out how to do it using chr() as the
> documentation suggests.

Now, here you have me confused.  I think you owe us a short test
program which runs without errors or warnings, and demonstrates what
you were asserting.

Both of them should produce the same internal representation.  If you
output them in a situation where the Perl system thinks you want
iso-8859-1 coding as output, then you should get a single octet of
value hex A9, but this, you say, is not what you want.  If you tell
the Perl system that you want utf-8 coded output, then you should get
utf-8-coded output.

Code snippet:

my $foo = chr(169);
print "|\x{00A9}|$foo|\n";

binmode STDOUT, ":utf8";
print "|\x{00A9}|$foo|\n";

Output (here I'm pasting raw octets into my iso-8859-1-coded
posting, just what I was complaining about before, but it's to
make a point):

|?|?|
|??|??|

But I urge you, don't just paste this into your solution: take a
moment to understand what you're doing - it'll stand you in good stead
later.

cheers



Wed, 23 Nov 2005 18:28:07 GMT  
 Converting numerical character references to Unicode
Yup, that works, Thanks! For some reason I didn't think the binmode
thing was the way to do it in Perl 5.8. I tried a whole variety
of things using Unicode::String, too numerous to post here. Binmode
just seemed so perl 5.6. Now I know.

my $string = ''; # Greek capital Alpha
my ($num) = $string =~ /&#x(\d{4});/;
my $char = chr(oct("0x$num"));

binmode STDOUT, ":utf8";
print "|\x{0391}|$char|\n";

|||

Quote:

> Code snippet:

> my $foo = chr(169);
> print "|\x{00A9}|$foo|\n";

> binmode STDOUT, ":utf8";
> print "|\x{00A9}|$foo|\n";



Wed, 23 Nov 2005 19:15:15 GMT  
 Converting numerical character references to Unicode
On Sat, Jun 7, Arvin Portlock inscribed on the eternal scroll:

Quote:
> Yup, that works, Thanks! For some reason I didn't think the binmode
> thing was the way to do it in Perl 5.8.

You can put the coding discipline on 'open' if you want, rather than
binmode.  The tutorial goes into detail, as I said.

Quote:
> I tried a whole variety of things using Unicode::String, too
> numerous to post here. Binmode just seemed so perl 5.6. Now I know.

It's also possible to communicate the required coding to the Perl
implicitly via a locale setting - which has, apparently, come as a
considerable surprise to users of some recent linux distributions -
but I reckon you're better off to understand what you're doing, rather
than waving a dead chicken over it and hoping it'll all come out right
in the end.  I found the tutorial helpful, which was why I commended
it to you.

all the best.



Thu, 24 Nov 2005 02:51:36 GMT  
 
 [ 7 post ] 

 Relevant Pages 

1. Convert Chinese Character to Unicode (not UTF-8)?

2. remove non-numerical characters from a string

3. Convert ASCII characters to BINARY characters...

4. Converting HTML hex characters to characters

5. Convert reference to array into an array of references

6. How do I translate between the Unicode 1.1 Hangul character set and the Unicode 3.1 Hangul character set?

7. convert unicode dash

8. converting hex to unicode

9. Unicode/high-bit character translation help.

10. Can't print Unicode characters.

11. unicode character error

12. Help with DBI and unicode characters?

 

 
Powered by phpBB® Forum Software