Create UTF-8 string from numbers? 
Author Message
 Create UTF-8 string from numbers?

I have the following piece of code.  I expect $s to be a UTF-8
encoded string like this:

    $expected_result = "/" . chr(130) . "." .
                       "/" . chr(131) . ".";

So, here is the script:

/----[ foo.pl ]
| #!/usr/sw/perl/default/bin/perl
|
| use strict;
| use warnings;
| no bytes;
| use utf8;
|
| my %table = ( a => 130,
|               b => 131 );
|
| my $s = '/a[1]/b[2]';
|
| $s =~ s{([a-zA-Z]+)}{
|     chr($table{$1})
| }eg;
|
| $s =~ s{\[\d+\]}{.}g;
|
| print $s, "\n";
|
| 1;
\----

The script, however, prints:

/----

| 00000000  2f 82 2e 2f 83 2e 0a                              |/../...|
| 00000007
\----

Surely the UTF-8 encoding of characters number 130 and 131 are
two-byte encodings!

So, how do I make it so that $s is a UTF-8 encoded string?

tia,
kai
--
~/.signature is: umop 3p!sdn    (Frank Nobis)



Mon, 19 Jul 2004 00:19:48 GMT  
 Create UTF-8 string from numbers?

Quote:

> I have the following piece of code.  I expect $s to be a UTF-8
> encoded string like this:

>     $expected_result = "/" . chr(130) . "." .
>                        "/" . chr(131) . ".";

> So, here is the script:

[...]

Quote:
> The script, however, prints:

> /----

> | 00000000  2f 82 2e 2f 83 2e 0a                              |/../...|
> | 00000007
> \----

> Surely the UTF-8 encoding of characters number 130 and 131 are
> two-byte encodings!

> So, how do I make it so that $s is a UTF-8 encoded string?

You need to use unpack('U', char) and pack('U', code) to do this:

my $s = '/' . pack('U', 130) . './' . pack('U', 131);

        Torsten



Tue, 20 Jul 2004 02:12:25 GMT  
 Create UTF-8 string from numbers?

Quote:

> I have the following piece of code.  I expect $s to be a UTF-8
> encoded string like this:

>     $expected_result = "/" . chr(130) . "." .
>                        "/" . chr(131) . ".";

> So, here is the script:

> /----[ foo.pl ]
> | #!/usr/sw/perl/default/bin/perl
> |
> | use strict;
> | use warnings;
> | no bytes;
> | use utf8;
> |
> | my %table = ( a => 130,
> |               b => 131 );
> |
> | my $s = '/a[1]/b[2]';

$s = chr(4711) . $s;

Quote:
> | $s =~ s{([a-zA-Z]+)}{
> |     chr($table{$1})
> | }eg;
> |
> | $s =~ s{\[\d+\]}{.}g;

$s = substr($s, 1);

Quote:
> | print $s, "\n";
> |
> | 1;
> \----

The original script creates a unibyte string.  But by adding the two
lines, as indicated, I get the expected result: it prints something
which looks like the UTF-8 encoding of the characters 130 and 131.

But how do I achieve the same result without this horrible kludge?

kai
--
~/.signature is: umop 3p!sdn    (Frank Nobis)



Tue, 20 Jul 2004 22:38:57 GMT  
 Create UTF-8 string from numbers?

Quote:

> I have the following piece of code.  I expect $s to be a UTF-8
> encoded string like this:

>     $expected_result = "/" . chr(130) . "." .
>                        "/" . chr(131) . ".";

You shouldn't ;)

Perl handles strings internally as UTF-8, sometimes, but that's not
supposed to be relied on by the user AFAIK. If you want UTF-8, you'll
have to create it explicitly, using (for example) the Encode module.
This *may* be a no-op (if the string was already UTF-8 encoded), or it
may do something (if it wasn't).

Quote:
> Surely the UTF-8 encoding of characters number 130 and 131 are
> two-byte encodings!

Yes. But if there are no characters > 255 in the string, and haven't
been during the lifetime of the string, then Perl often keeps it in the
local 8-bit encoding.

Quote:
> So, how do I make it so that $s is a UTF-8 encoded string?

I think another cludge is to pack it with 'U0U*' into itself. But you
should really be doing this explicitly rather than relying on Perl's
implementation.

Cheers,
Philip
--

That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.



Wed, 21 Jul 2004 04:42:59 GMT  
 Create UTF-8 string from numbers?
[A complimentary Cc of this posting was sent to
Philip Newton

Quote:
> Perl handles strings internally as UTF-8, sometimes, but that's not
> supposed to be relied on by the user AFAIK. If you want UTF-8, you'll
> have to create it explicitly, using (for example) the Encode module.
> This *may* be a no-op (if the string was already UTF-8 encoded), or it
> may do something (if it wasn't).

Do not confuse people unnecessary.  Perl handles strings internally in
its internal way.  Converting to UTF-8 is *never* a no-op (if there
are chars above 127 there).  E.g., converting "\x80" to UTF-8 produces
"\xC2\x80".  How could it be a no-op?!

Ilya



Wed, 21 Jul 2004 05:45:57 GMT  
 Create UTF-8 string from numbers?
On Fri, 1 Feb 2002 21:45:57 +0000 (UTC), Ilya Zakharevich

Quote:

> [A complimentary Cc of this posting was sent to
> Philip Newton

> > Perl handles strings internally as UTF-8, sometimes, but that's not
> > supposed to be relied on by the user AFAIK. If you want UTF-8, you'll
> > have to create it explicitly, using (for example) the Encode module.
> > This *may* be a no-op (if the string was already UTF-8 encoded), or it
> > may do something (if it wasn't).

> Do not confuse people unnecessary.  Perl handles strings internally in
> its internal way.  Converting to UTF-8 is *never* a no-op (if there
> are chars above 127 there).  E.g., converting "\x80" to UTF-8 produces
> "\xC2\x80".  How could it be a no-op?!

That is not a no-op; it is a case where Perl "does something" (the
second half of my last sentence).

An example of a no-op would if a Perl string which has the UTF-8 flag on
is to be converted to UTF-8 by the Encode module.

Since the string is already made up of UTF-8 encoded characters,
converting it to UTF-8 is a no-op. As an example, a string with
"\x{20ac}" in it is represented in Perl's guts as "\342\202\254", and
converting that to UTF-8 is a no-op. However, AFAIK there is no
guarantee that Perl will always represent "\x{20ac}" as four bytes in
memory with the values 0342 0202 0254 0; it may, for example, be three
bytes: 0x20 0xac 0x00, if Perl were to use UTF-16 internally instead of
UTF-8.

Apologies for any confusion caused, and I hope this is a bit clearer.

Cheers,
Philip
--

That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.



Wed, 21 Jul 2004 19:02:26 GMT  
 Create UTF-8 string from numbers?
[A complimentary Cc of this posting was sent to
Philip Newton

Quote:
> > Do not confuse people unnecessary.  Perl handles strings internally in
> > its internal way.  Converting to UTF-8 is *never* a no-op (if there
> > are chars above 127 there).  E.g., converting "\x80" to UTF-8 produces
> > "\xC2\x80".  How could it be a no-op?!
> An example of a no-op would if a Perl string which has the UTF-8 flag on
> is to be converted to UTF-8 by the Encode module.

See above: since the input string is different from the output string,
this *cannot* be a no-op.

Quote:
> Since the string is already made up of UTF-8 encoded characters,

This is your principal confusion: Perl strings are (modulo a very few
bugs) made of *character*, not *blah-blah-blah characters*.  A
*character* is just an integer, nothing more, nothing less.

Quote:
> As an example, a string with "\x{20ac}" in it is represented in
> Perl's guts as "\342\202\254",

a) It is not; [It is represented as an SV* which points to SV, which -
   among other info - contains a pointer to (say) XPVNV, which - among
   other info - contains a pointer to a memory region which may - or
   may not - contain the bytes you mention]

b) Who cares how it is represented in the guts anyway? [Remember that
   this memory region may, depending on the OS state, be represented
   by some magnitization states on the HD media and/or some charges
   accumulated in the silicon chips]

Ilya



Thu, 22 Jul 2004 06:12:02 GMT  
 Create UTF-8 string from numbers?

Quote:


>> So, how do I make it so that $s is a UTF-8 encoded string?

> You need to use unpack('U', char) and pack('U', code) to do this:

> my $s = '/' . pack('U', 130) . './' . pack('U', 131);

I actually found out about this myself, and I was very happy.  For
about ten minutes.  Because that was when I found out that substr
was doing the "wrong thing": it used byte indexes, not character
indexes.

So after doing, say, $s = pack("U*", 131, 132, 133) I expected
unpack("U", substr($s, -1)) to return 133.  Not so, alas.

Maybe I should explain in more detail what I'm trying to do.  This
will be in another posting.

kai
--
~/.signature is: umop 3p!sdn    (Frank Nobis)



Fri, 23 Jul 2004 00:47:36 GMT  
 Create UTF-8 string from numbers?

Quote:

> Perl handles strings internally as UTF-8, sometimes, but that's not
> supposed to be relied on by the user AFAIK. If you want UTF-8, you'll
> have to create it explicitly, using (for example) the Encode module.
> This *may* be a no-op (if the string was already UTF-8 encoded), or it
> may do something (if it wasn't).

I don't have the Encode module, yet (says perlindex), but I'm going
to have it installed.  But before I fall into another trap, one more
question:

What I actually want to do is to encode a structure into a string
which looks like this: /foo[1]/bar[2]/baz[3]/#PCDATA[4].
This identifies a node in an XML DOM tree: from the root, take the
first foo child, then the second bar child, and then the third baz
child, and of that the fourth text node child.

Clearly, the element names (and #PCDATA) can be encoded in numbers,
and the indices are already numbers.  So, I can convert this into an
array like this:

    (131, 1, 132, 2, 133, 3, 1, 4)

Here, 1 is the number for #PCDATA, 131, 132 and 133 are the numbers
for the various element names.  Now I want to encode each integer
into a character, so that the whole array becomes a string.

And then I have "path conditions" (think XPath) like /foo/*//#PCDATA.
It is clear that these thingies can easily be converted to a regular
expression, to be matched against the string as produced above.  I
just convert the element names into the appropriate character, match
each index with ".", the wildcard "*" is also converted to ".", and so
on.

But the problem is, if the original string contains a number greater 255,
then the original string will be in UTF-8 format.  But if the regular
expression does not contain a number greater than 255, the regular
expression will be in single-byte format.  And then the regular
expression does not match the original string.

In addition to the regular expression matching, I also want to access
parts of the string directly.  For example, I might wish to chop off
the last two characters with $s = substr($s, 0, -2) or similar.

But if I produce the string with pack("U*", 131, 1, 132, ...), then
Perrl will think of it as a number of bytes, so that substr shaves
off bytes rather than characters.  (I tried this.)

So, does the Encode module guarantee that the resulting string will
be considered as UTF-8 string by Perl so that substr works on
characters, not bytes?

kai

PS: Some characters in regular expressions are special, so I'm adding
    an offset to all numbers before converting them into characters.

--
~/.signature is: umop 3p!sdn    (Frank Nobis)



Fri, 23 Jul 2004 01:02:12 GMT  
 Create UTF-8 string from numbers?
[snip]

Quote:
> What I actually want to do is to encode a structure into a string
> which looks like this: /foo[1]/bar[2]/baz[3]/#PCDATA[4].
> This identifies a node in an XML DOM tree: from the root, take the
> first foo child, then the second bar child, and then the third baz
> child, and of that the fourth text node child.

> Clearly, the element names (and #PCDATA) can be encoded in numbers,
> and the indices are already numbers.  So, I can convert this into an
> array like this:

>     (131, 1, 132, 2, 133, 3, 1, 4)

> Here, 1 is the number for #PCDATA, 131, 132 and 133 are the numbers
> for the various element names.  Now I want to encode each integer
> into a character, so that the whole array becomes a string.

> And then I have "path conditions" (think XPath) like /foo/*//#PCDATA.
> It is clear that these thingies can easily be converted to a regular
> expression, to be matched against the string as produced above.  I
> just convert the element names into the appropriate character, match
> each index with ".", the wildcard "*" is also converted to ".", and so
> on.

I would expect wildcard "*" would be converted to ".." or ".*", since
you want to match not just things like "bar" or "[1]", but "bar[1]".

Quote:
> But the problem is, if the original string contains a number greater
> 255, then the original string will be in UTF-8 format.  But if the
> regular expression does not contain a number greater than 255, the
> regular expression will be in single-byte format.

This in and of itself is not a problem...

Quote:
> And then the regular expression does not match the original string.

But this is!
You could have gotten much faster help if you'd mentioned this.

Your problem isn't "I need the string to be encoded in utf8", your
problem is that "Regular expressions seem to see strings produced by
pack("U") strings as bytes, rather than chars."

What version of perl are you using?  If it's perl5.6.0, try upgrading to
perl5.6.1 (it's quite possible that it's broken in one and fixed in the
other).  If you're already using 5.6.1, then look into using the lastest
working version of 5.7.2 (aka bleadperl).

Quote:
> In addition to the regular expression matching, I also want to access
> parts of the string directly.  For example, I might wish to chop off
> the last two characters with $s = substr($s, 0, -2) or similar.

> But if I produce the string with pack("U*", 131, 1, 132, ...), then
> Perrl will think of it as a number of bytes, so that substr shaves
> off bytes rather than characters.  (I tried this.)

Again, this seems to be the same problem: perl converting the data to
utf8 format internally, but not setting the utf8 flag.

Quote:
> So, does the Encode module guarantee that the resulting string will
> be considered as UTF-8 string by Perl so that substr works on
> characters, not bytes?

If the input is not flagged as utf8, it will perform an ascii -> utf8
conversion, and set the utf8 flag.

If your input is already in utf8, but it doesn't have the utf8 flag,
then you'll have one conversion too many.

Quote:
> PS: Some characters in regular expressions are special, so I'm adding
>     an offset to all numbers before converting them into characters.

Fair enough.

I would suggest the following:

my $dom_node = q[/foo[1]/bar[2]/baz[3]/#PCDATA[4]];
my %nums; # translate names to numbers.
my $packed = join "", map chr($_+0x20_0000),
    map {
        my $num = s!\[(\d+)\]\z!! ? $1 : 0;
        $nums{$_} = keys %nums unless exists $nums{$_};
        $nums{$_} , $num
    }, $dom_node =~ m[[^/]*]g;

my $pattern = q[/foo/*//#PCDATA];
my $realpat = join"", map {
    my $num = s!\[(\d+)\]\z!! ? chr($1 + 0x20_0000) : "."
    my $name = length && $_ ne "*" ? do {
        $nums{$_} = keys %nums unless exists $nums{$_};
        chr( $nums{$_} + 0x20_000 )
    } : ".";
    $name, $num

Quote:
}, $pattern =~ m[[^/]*]g;

$realpat = qr/^$realpat\z/;

print( $packed =~ $realpat ? "matched\n" : "didn't match\n" );

my %names = reverse %nums;

    s/(.)(.)/
        $names{ord($1)-0x20_0000} .
        "[". (ord($2)-0x20_0000) ."]"
    /ge;
    print $_, "\n";

Quote:
}

Caveat Lector: This code is untested.
I *think* but am not sure, that chr() is smart enough to set the utf8
flag when necessary, even if pack("U") isn't quite so smart.

--
A child of 5 could understand this!  Fetch me a child of 5



Mon, 26 Jul 2004 10:30:49 GMT  
 Create UTF-8 string from numbers?
Hi, Ilia! Privet :-)

Quote:
> > As an example, a string with "\x{20ac}" in it is represented in
> > Perl's guts as "\342\202\254",

> a) It is not; [It is represented as an SV* which points to SV, which -
>    among other info - contains a pointer to (say) XPVNV, which - among
>    other info - contains a pointer to a memory region which may - or
>    may not - contain the bytes you mention]

> b) Who cares how it is represented in the guts anyway? [Remember that
>    this memory region may, depending on the OS state, be represented
>    by some magnitization states on the HD media and/or some charges
>    accumulated in the silicon chips]

> Ilya

Thanks for answering things on the mail-list, it often sheds
light on complicated questions ;-)

Currently if you allow I would express my confusion
from your statement: "who cares".. My question is
what is actually written to output files? I guess streams of octets.
But what are these streams of octets?

Aren't they the _real internal Perl's_ representation
of the strings? If not what are they? What encoding?

TIA for your kind reply, Anton Tagunov

---



Wed, 11 Aug 2004 10:34:35 GMT  
 Create UTF-8 string from numbers?
Is there a recommended "best WTDI" for converting between 12- and 24-hour
time?  i.e. I'll have a string in one format or the other and want to
convert.  I can always subtract (or add) 1200 and juggle the colons by hand.
Does anyone have  a "cleaner" suggestion?

The code already uses Date::Format and it looked promising but it seems to
want a full "seconds since epoch" string and that's not what I've got.

(The problem in a nutshell: given a start and end time in 24-hour clock
format, generate a list of times at 30 minute intervals in 12-hour clock
format suitable for presenting to the user in an HTML menu)

--
- Vicki

Vicki Brown     ZZZ                  Journeyman Sourceror:
P.O. Box 1269      zz  |\     _,,,---,,_        Scripts & Philtres
San Bruno, CA       zz /,`.-'`'    -.  ;-;;,_     Perl, Unix, MacOS
94066     USA         |,4-  ) )-,_. ,\ ( `'-'



Sat, 14 Aug 2004 06:33:01 GMT  
 
 [ 38 post ]  Go to page: [1] [2] [3]

 Relevant Pages 

1. Paradox table size VS performance.

2. Creating hex strings of IEEE Single Precision Numbers in PERL

3. creating unique numbers/strings

4. UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug

5. Split losing UTF-8 flag on UTF-8 scalars

6. Sorting a linked list

7. Q: Database Advice

8. error message

9. BDE 3.5 Bug?

10. Speed of substr operation on UTF-8 strings?

11. Converting UTF-16 string from big-endian to little-endian

 

 
Powered by phpBB® Forum Software