Unicode in Per. 5.6 - broken? 
Author Message
 Unicode in Per. 5.6 - broken?

I just started experimenting with the new Unicode features
of Perl 5.6 using the ActiveState Perl 5.6 in a Windows 98
environment. I found some serious problems. Either my
understanding of a few things is severly broken, or else
Perl 5.6 is.

My main problem concerns upper/lower case translation.
Things just don't behave in an understandable way, and
some things seem seriously broken.

To summarize the problems I've met I concocted the following
CGI-script. The result is meant to be read with at browser
that can deal with UTF-8, and that can show the necessary
glyphs for characters from Latin-1 Supplement and Latin
Extended A.

The code uses UTF-8 and won't make much sense if your
reader can't interpret UTF-8.

-------------------------------------------------------

#!/usr/bin/perl

use strict;
use utf8;

my ($string, $upper_string1, $upper_string2);

$string = "abc ABC ??? ??? ?????? ??????";

$upper_string1 = $string;
$upper_string1 =~ s/(\w)/\U$1\E/g;

$upper_string2 = $string;
$upper_string2 = uc($upper_string2);

my ($read_string, $upper_read_string1, $upper_read_string2);

# THE FILE "read_string.txt" SHOULD CONTAIN THE SAME
# LINE OF CHARACTERS AS "$string" ABOVE - ENCODED IN UTF-8.
open(READ,"read_string.txt");
$read_string = <READ>;
close READ;

$upper_read_string1 = $read_string;
$upper_read_string1 =~ s/(\w)/\U$1\E/g;

$upper_read_string2 = $read_string;
$upper_read_string2 = uc($upper_read_string2);

print qq(Content-type: text/html; charset=utf-8\n\n);

print qq(<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n);
print qq(    " http://www.*-*-*.com/ ;>\n);
print qq(<html xmlns=" http://www.*-*-*.com/ ;>\n);
print qq(<head>\n);
print qq(<title>Test</title>\n);
print qq(</head>\n);
print qq(<body>\n);

print qq(<p>STRING: $string</p\n>);
print qq(<p>UPPERCASED STRING 1:  $upper_string1</p>\n);
print qq(<p>UPPERCASED STRING 2:  $upper_string2</p>\n);

print qq(<p>READ STRING: $read_string</p\n>);
print qq(<p>UPPERCASED READ STRING 1:  $upper_read_string1</p>\n);
print qq(<p>UPPERCASED READ STRING 2:  $upper_read_string2</p>\n);

print qq(</body>\n);
print qq(</html>\n);

-------------------------------------------------------

If you manage to get this code to work in a CGI environment,
you'll see that the uppercasing of the string turns out in
several different ways, and that it is correct in only half of the
time. The "\U" method works only for the string that is read from
a file. The "uc()" method works only for the other string (declared
in the perl code.

This is what I get in my broser:

-------------------------------------------------------

STRING: abc ABC ??? ??? ?????? ??????

UPPERCASED STRING 1: AFC ABC ??1? 01?01 ??1??1??1 ?01?01?01

UPPERCASED STRING 2: ABC ABC ??? ??? ?????? ??????

READ STRING: abc ABC ??? ??? ?????? ??????

UPPERCASED READ STRING 1: ABC ABC ??? ??? ?????? ??????

UPPERCASED READ STRING 2: ABC ABC ??? ??? ?????? ??????

-------------------------------------------------------

The second line is particulary horrible. Note that "b" is
uppercased to "F"! What is going on here?

If the line "use utf8" is removed things look a bit less broken at
first glance, but actually uppercasing of non-ASCII characters
doesn't work at all without the utf8 pragma.

Is it me or is it perl?

My Perl version says:

  This is perl, v5.6.0 built for MSWin32-x86-multi-thread
  (with 1 registered patch, see perl -V for more detail)

  Copyright 1987-2000, Larry Wall

  Binary build 613 provided by ActiveState Tool Corp.
  http://www.*-*-*.com/
  Built 12:36:25 Mar 24 2000

--
#####################################################################
                         Bertilo Wennergren
                 < http://www.*-*-*.com/ ;

#####################################################################



Fri, 18 Oct 2002 03:00:00 GMT  
 Unicode in Per. 5.6 - broken?

Quote:

># THE FILE "read_string.txt" SHOULD CONTAIN THE SAME
># LINE OF CHARACTERS AS "$string" ABOVE - ENCODED IN UTF-8.
>open(READ,"read_string.txt");
>$read_string = <READ>;
>close READ;

Perl version 5.6.0 can create and write UTF-8 character strings, but it
currently cannot read them in from files.  $read_string will be a
string of 8-bit bytes, not a bunch of Unicode characters.

        -Joe
--
See http://www.inwap.com/ for PDP-10 and "ReBoot" pages.



Mon, 21 Oct 2002 03:00:00 GMT  
 Unicode in Per. 5.6 - broken?

"Joe Smith":

Quote:


> ># THE FILE "read_string.txt" SHOULD CONTAIN THE SAME
> ># LINE OF CHARACTERS AS "$string" ABOVE - ENCODED IN UTF-8.
> >open(READ,"read_string.txt");
> >$read_string = <READ>;
> >close READ;
> Perl version 5.6.0 can create and write UTF-8 character strings, but it
> currently cannot read them in from files.  $read_string will be a
> string of 8-bit bytes, not a bunch of Unicode characters.

True. That part of the problem has been explained. But there were
strange results with the part that used a string declared inside
of the script too.

Anyway: Is there a way to convert a string of bytes read from
a file so that the string becomes what it should have been:
a UTF-8 encoded string?

--
#####################################################################
                         Bertilo Wennergren
                 <http://purl.oclc.org/net/bertilo>

#####################################################################



Mon, 21 Oct 2002 03:00:00 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Search on multiple fields

2. Unicode in Perl 5.6 - broken?

3. Path problem using activestate per 5.6 on windows 95

4. unicode support in perl 5.6 -- I'm trying to get it to work l

5. unicode support in perl 5.6 -- I'm trying to get it to work like

6. undocumented fatal error, 5.6, unicode

7. Tutor for making web appl. with D7?

8. network Paradox databases

9. Connecting to DB2/2 under Win3.1

10. Perl 5.6 breaks Storable

11. seeking published info / lines-of-code per programmer per day

12. seeking published info / lines-of-code per programmer per day

 

 
Powered by phpBB® Forum Software