Character encoding problem with XML::Parser 
Author Message
 Character encoding problem with XML::Parser

I feel somewhat confused by the way Perl and/or XML::Parser handling of
character encodings. One would assume that

print $a; print $b;

is exactly equivalent to

print "$a$b";

but this is not always the case. The testcase below shows that printing the
two strings separately produces iso-8859-1 encoded output but printing them
concatenated produces utf-8 (?) output.

The output from running the test under perl 5.6.1, XML::Parser 2.30 is:


00000000  41 e4 0a 41 c3 a4 0a                              |A..A...|
00000007

In perl 5.005_03, XML::Parser 2.27, the output is:

00000000  41 e4 0a 41 e4 0a                                 |A..A..|
00000006

I would like to make sure that the output is always iso-8859-1 (as with
perl 5.005_03, XML::Parser 2.27) without having to print each string
separately. How can I achieve this?

Regards

Mattias Holmlund

isotest.pl:
#!/usr/bin/perl -w

use strict;
use XML::Parser;

my $p1 = new XML::Parser(Style => 'Tree');
my $tree = $p1->parse(<<EOXML);
<?xml version='1.0' encoding='ISO-8859-1'?>
<root a='A'/>
EOXML

$a=$tree->[1]->[0]->{a};
$b="\xE4"; # &auml; encoded in iso-8859-1.

print $a;
print $b;
print "\n";

print "$a$b\n";



Mon, 06 Sep 2004 04:02:41 GMT  
 Character encoding problem with XML::Parser

Quote:
> ... perl 5.6.1,    XML::Parser 2.30: 41 e4 0a 41 c3 a4 0a                            
> ... perl 5.005_03, XML::Parser 2.27: 41 e4 0a 41 e4 0a
> I would like to make sure that the output is [like the second]

FYI, perl 5.6.1, XML::Parser 2.27 (ActiveState build 631) on Win2K
produces output like the second (with CR's inserted for the line breaks).
So maybe it's the parser, not perl?


Mon, 06 Sep 2004 22:01:17 GMT  
 Character encoding problem with XML::Parser


Quote:
>I feel somewhat confused by the way Perl and/or XML::Parser handling of
>character encodings. One would assume that

>print $a; print $b;

>is exactly equivalent to

>print "$a$b";

>but this is not always the case.

Bad assumption.  The statement
        print "$a$b";
is more like
        print "$a"; print "$b";
instead of
        print $a; print $b;

Double quotes are the reason why

is different from

since the former uses ${"} and the latter uses ${,} special variables.

                -Joe
--
See http://www.inwap.com/ for PDP-10 and "ReBoot" pages.



Tue, 07 Sep 2004 11:29:44 GMT  
 Character encoding problem with XML::Parser

Quote:

> > ... perl 5.6.1,    XML::Parser 2.30: 41 e4 0a 41 c3 a4 0a                            
> > ... perl 5.005_03, XML::Parser 2.27: 41 e4 0a 41 e4 0a
> > I would like to make sure that the output is [like the second]

> FYI, perl 5.6.1, XML::Parser 2.27 (ActiveState build 631) on Win2K
> produces output like the second (with CR's inserted for the line breaks).
> So maybe it's the parser, not perl?

I looked at the Changelog for XML::Parser. For version 2.28 there is
an entry:

 - Merged patches supplied by Larry Wall to (for perl 5.6 and beyond)
   tag generated strings as UTF-8, where appropriate.

I don't know enough about perl internals, but it sounds like there is
some internal flag tells whether a string is utf-8 or something else.
It seems as if there's a problem with concatenating a utf-8 string
with a non utf-8 string. I have no idea whether the bug is in perl or
XML::Parser though or if this is the way it is supposed to work. I
read about Encode.pm in perl 5.7.2, but I haven't found anything
equivalent in perl 5.6.1.

/Mattias



Tue, 07 Sep 2004 22:58:24 GMT  
 Character encoding problem with XML::Parser

Quote:



> >I feel somewhat confused by the way Perl and/or XML::Parser handling of
> >character encodings. One would assume that

> >print $a; print $b;

> >is exactly equivalent to

> >print "$a$b";

> >but this is not always the case.

> Bad assumption.  The statement
>         print "$a$b";
> is more like
>         print "$a"; print "$b";
> instead of
>         print $a; print $b;

Well, actually
    print $a; print $b;
is interpreted as:
    print $a . $\; print $b . $\;
And:
    print "$a$b";
is interpreted as:
    print "$a$b" . $\;

Quote:
> Double quotes are the reason why

> is different from

> since the former uses ${"} and the latter uses ${,} special variables.

John
--
use Perl;
program
fulfillment


Wed, 08 Sep 2004 04:11:32 GMT  
 
 [ 5 post ] 

 Relevant Pages 

1. IDDA3532.DLL AV

2. Help with Duplicate key errors

3. Problem with XML::Parser and scandinavian characters

4. Perl XML Parser / Twig and Unicode Encoding of Accents

5. Encoding with XML::Parser

6. TV Tutorial

7. SQLServer, Sybase, InterBase and ????

8. Read/Write binary data to ORACLE. HOW?

9. Speaker on Delphi

10. XML::Writer - having trouble encoding characters

11. character encoding from XML::Twig with URI::Escape

12. XML::Parser/XML::Parser::Expat

 

 
Powered by phpBB® Forum Software