Questions about Perl's Unicode Model 
Author Message
 Questions about Perl's Unicode Model

Hi,

Perl's documentation says the following about Perl's Unicode model:

<snip>
Perl's Unicode Model
Perl supports both pre-5.6 strings of eight-bit native bytes, and strings of
Unicode characters. The principle is that Perl tries to keep its data as
eight-bit bytes for as long as possible, but as soon as Unicodeness cannot
be avoided, the data is transparently upgraded to Unicode.

Internally, Perl currently uses either whatever the native eight-bit
character set of the platform (for example Latin-1) is, defaulting to UTF-8,
to encode Unicode strings. Specifically, if all code points in the string
are 0xFF or less, Perl uses the native eight-bit character set. Otherwise,
it uses UTF-8.

<snip>

I want to read some locally encoded (SJIS) data from a file and I want to
perform string manipulation functions on this data. Currently the string
manipulation functions fail (upper/lower) because the SJIS data is
incorrectly read as UTF8.

Can anyone help here? Is there a method of disabling this default behaviour?

Thanks in advance.

Regards

~Rashmi



Mon, 26 Sep 2005 14:30:46 GMT  
 Questions about Perl's Unicode Model

Quote:

> Hi,

> Perl's documentation says the following about Perl's Unicode model:

> <snip>
[snip :) ]
> <snip>

> I want to read some locally encoded (SJIS) data from a file and I want to
> perform string manipulation functions on this data. Currently the string
> manipulation functions fail (upper/lower) because the SJIS data is
> incorrectly read as UTF8.

> Can anyone help here? Is there a method of disabling this default behaviour?

You could try playing around with binmode, e.g. something like:
  open(F, "< /path/to/file");
  binmode(F, ":utf8");

See if that helps.

G'luck,
Peter



Mon, 26 Sep 2005 23:46:52 GMT  
 Questions about Perl's Unicode Model
On Thu, Apr 9, Rashmi Dixit inscribed on the eternal scroll:

Quote:
> I want to read some locally encoded (SJIS) data from a file and I want to
> perform string manipulation functions on this data.

I'm no expert on shift_jis or indeed on any CJK coding, but I think I
can take a shot at pointing you to the relevant documentation, at
least.  Assuming Perl >= 5.8.0.

Perl would like to work in utf-8, so the natural thing to do would
be to read your data with the appropriate :encoding discipline in
effect.  See http://www.perldoc.com/perl5.8.0/pod/perluniintro.html
or your local copy of the documentation.

Quote:
> Currently the string
> manipulation functions fail (upper/lower) because the SJIS data is
> incorrectly read as UTF8.

Quite.

Quote:
> Can anyone help here? Is there a method of disabling this default behaviour?

Of course (clue: straightforward binmode() ), but then you would need
to handle the shift_jis encoding yourself.  I'd imagine it would be
less work to normalise the external coding into utf-8 coding for
internal working, and let Perl's own string functions cope with the
result.  Output (assuming you want shift_jis output too) would be the
converse of input.


Tue, 27 Sep 2005 00:25:31 GMT  
 Questions about Perl's Unicode Model

Quote:

> Hi,

> Perl's documentation says the following about Perl's Unicode model:

> <snip>
> Perl's Unicode Model
> Perl supports both pre-5.6 strings of eight-bit native bytes, and
> strings of Unicode characters. The principle is that Perl tries to keep
> its data as eight-bit bytes for as long as possible, but as soon as
> Unicodeness cannot be avoided, the data is transparently upgraded to
> Unicode.

> Internally, Perl currently uses either whatever the native eight-bit
> character set of the platform (for example Latin-1) is, defaulting to
> UTF-8, to encode Unicode strings. Specifically, if all code points in
> the string are 0xFF or less, Perl uses the native eight-bit character
> set. Otherwise, it uses UTF-8.

> <snip>

> I want to read some locally encoded (SJIS) data from a file and I want
> to perform string manipulation functions on this data. Currently the
> string manipulation functions fail (upper/lower) because the SJIS data
> is incorrectly read as UTF8.

That's strange -- I would expect it to fail due to the SJIS encoded text
being incorrectly interpreted as latin-1 encoded text, *not* due to it
being incorrectly interpreted as utf8 encoded text.

What platform are you on?

Quote:
> Can anyone help here? Is there a method of disabling this default
> behaviour?

To alter the default open mode for text files, you can do:
   use open ":encoding(shiftjis)";
This will make all newly opened filehandles be interpreted as sjis,
unless you explicitly open them as something else.  For example:
   open( my $fh, "<:raw", $name ) or die horribly;
to open a file in binary mode, or:
   open( my $fh, "<:encoding(latin1)", $name ) or die horribly;
to open a file in latin1 mode.

To change the encoding for an already opened filehandle, you can do:
   binmode( $fh, ":raw:encoding(shiftjis)" );

PS: You cannot simply push a :encoding(...) layer onto a handle which
already has some :encoding(...) layer on it -- if you do, your data will
get doubly encoded.  This means that if you use the 'open' pragma to
define a default encoding, then whenever you want to alter the encoding
of a handle using binmode(), you need to stick :raw in front of the new
:encoding, to remove the old one.

Yhis restriction will likely be removed in a future version of Encode.

--





Tue, 27 Sep 2005 09:35:19 GMT  
 Questions about Perl's Unicode Model
On Thu, Apr 10, Benjamin Goldberg inscribed on the eternal scroll:

Quote:
> That's strange -- I would expect it to fail due to the SJIS encoded text
> being incorrectly interpreted as latin-1 encoded text, *not* due to it
> being incorrectly interpreted as utf8 encoded text.

Consider, for example, RedHat 8 ?

cheers



Tue, 27 Sep 2005 18:04:37 GMT  
 
 [ 5 post ] 

 Relevant Pages 

1. Questions about Perl's Unicode Model

2. simple question regarding regex's and unicode

3. perl's object model

4. Python's Reference And Internal Model Of Computing Languages

5. unicode support in perl 5.6 -- I'm trying to get it to work l

6. unicode support in perl 5.6 -- I'm trying to get it to work like

7. Getting MS Excel's cell contents of Unicode text from Perl

8. reading unicode file/creating unicode dir under w2k

9. Unicode Map8 or Unicode String install for Win32?

10. DESTROY.al message while using Unicode::Map8, Unicode::Map

11. Unicode-String and Unicode-Map8 for win32

12. Can't print Unicode characters.

 

 
Powered by phpBB® Forum Software