More Unicode Trouble 
Author Message
 More Unicode Trouble

Interesting to see that I'm not the only one suffering from Unicode hassle.

Support for Unicode in python seems rather imcomplete

I can work with MS languages, such as JScript, and manipulate Unicode strings just the same as ASCII
strings (for the most part). Input and output just seems to work. Indeed, it's so transparent that
to date, I've not really had to worry that Unicode exists at all.

However, in Python, I'm having all sorts of trouble. I can't print, I can't write Unicode strings to
files and I can't find any documentation on how to handle Unicode.

I need to sort this out. A system that we are soon to put live has started blowing all over the
place because the database is full of Unicode strings. (Rather conveniently, our test suite wasn't
pulling these out!)

I could do with some help!

Thanks.
--
Dale Strickland-Clark
Out-Think Ltd
Business Technology Consultants



Mon, 19 May 2003 03:00:00 GMT  
 More Unicode Trouble

Quote:

> However, in Python, I'm having all sorts of trouble. I can't print,
>I can't write Unicode strings to
> files and I can't find any documentation on how to handle Unicode.

You looked at
<URL:http://starship.python.net/crew/lemburg/unicode-proposal.txt>?
Or at the examples in the Unicode section of
<URL:http://www.python.org/2.0/new-python.html>?  

You'll have to remember that the default conversion from Unicode to
8-bit characters uses the ASCII encoding, which can't handle
characters >127; when dealing with such characters, you can't use any
automatic conversion but must explicitly choose which encoding.  Also,
various bits of the standard modules may not cope with Unicode
strings.

What problems are you running into?  Be specific...

--amk



Mon, 19 May 2003 03:00:00 GMT  
 More Unicode Trouble

Quote:

> Support for Unicode in Python seems rather imcomplete

compared to many other languages, it's very complete
(no speed penalty, i/o, lots of codecs, property databases,
regular expressions, etc) -- it's just that it refuses to con-
vert between encodings if you don't tell it what encoding
you want...

Quote:
> However, in Python, I'm having all sorts of trouble. I can't
> print

do you have a unicode terminal?

Quote:
> I can't write Unicode strings to files

ah, user testing: if you write a unicode string to a file, what
data do you expect to get in that file?  (on a byte level).

Quote:
> I can't find any documentation on how to handle Unicode.

start here:
http://www.python.org/2.0/new-python.html#SECTION000500000000000000000

then read:
http://www.lemburg.com/files/python/unicode-proposal.txt

</F>



Tue, 20 May 2003 13:57:03 GMT  
 More Unicode Trouble

Quote:


>> Support for Unicode in Python seems rather imcomplete

>compared to many other languages, it's very complete
>(no speed penalty, i/o, lots of codecs, property databases,
>regular expressions, etc) -- it's just that it refuses to con-
>vert between encodings if you don't tell it what encoding
>you want...

>> However, in Python, I'm having all sorts of trouble. I can't
>> print

>do you have a unicode terminal?

>> I can't write Unicode strings to files

>ah, user testing: if you write a unicode string to a file, what
>data do you expect to get in that file?  (on a byte level).

>> I can't find any documentation on how to handle Unicode.

>start here:
>http://www.python.org/2.0/new-python.html#SECTION000500000000000000000

>then read:
>http://www.lemburg.com/files/python/unicode-proposal.txt

></F>

Thanks for the references - although the first one always gives me 'Connection Refused' which seems
a bit odd..

I still don't see why an implicit coercion to a string using some sensible defaults can't be
performed when required. Then I'd only need to start exploring this area if the results aren't what
I need.

The current behaviour seems against what I see as one of the principal aims of Python - to relieve
the programmer of unnecessary burdens.
--
Dale Strickland-Clark
Out-Think Ltd
Business Technology Consultants



Tue, 20 May 2003 03:00:00 GMT  
 More Unicode Trouble

Quote:

> I still don't see why an implicit coercion to a string using some sensible
> defaults can't be performed when required.

What's a "sensible default"?

The original design used UTF-8.  As we found out, that's a great
way to cause all sorts of disasters, since 1) it breaks the notion
that a string is a sequence of characters, 2) causes all sorts of
problems for people using 8-bit character sets, etc, etc.  Hardly
any program is prepared to find UTF-8 in a plain text file, hardly
any output device can deal with it.  It may be a great encoding,
but it's a lousy default.

An alternate proposal used ISO-Latin-1.  It was voted down as
being too western-centric -- ignoring the fact that ISO and the
Unicode consortium had already made that decision, and that it
happens to be the default encoding on almost all Unix systems
and most Windows boxes.

A third alternative would be to use a system encoding.  There's
a getdefaultlocale function in the locale method that makes a
pretty good guess.

    >>> lang, encoding = locale.getdefaultlocale()
    >>> encoding
    'cp1252'

However, there's only one default encoding in the current unicode
system, and it affects all unicode to 8-bit translations.  Making this
one platform dependent might cause major portability headaches.

Another problem is that file objects don't support encodings.  That
can be addressed via the codec.EncodedFile wrapper class, but
lots of programs expect to find real file handles in sys.stdout (etc).
Adding encoding support to the standard file type would help here,
but we ran out of time...

So the 2.0 compromise was to default to pure US ASCII, and
raise an exception anytime a programmer tries to do something
that indicates that he doesn't really know what he's doing...

(some relevant pythonic theses follows:

    2. Explicit is better than implicit.
    10. Errors should never pass silently.

    11. Unless explicitly silenced.

    12. In the face of ambiguity, refuse the temptation to guess.

)

We can of course change this in 2.1.  Just gotta figure out how...

</F>



Tue, 20 May 2003 03:00:00 GMT  
 More Unicode Trouble

Quote:

> >start here:
> >http://www.python.org/2.0/new-python.html#SECTION000500000000000000000

> Thanks for the references - although the first one always
> gives me 'Connection Refused' which seems a bit odd..

python.org appears to be down.  here's a mirror:
http://www.cwi.nl/www.python.org/2.0/new-python.html#SECTION000500000...

</F>



Tue, 20 May 2003 03:00:00 GMT  
 More Unicode Trouble

Quote:
> Interesting to see that I'm not the only one suffering from Unicode
hassle.
> Support for Unicode in Python seems rather imcomplete

Seems rich enough to me, *considering* that Python has to live with
a legacy of single-byte-code strings as well as Unicode.

Quote:
> I can work with MS languages, such as JScript, and manipulate Unicode
strings just the
> same as ASCII strings (for the most part). Input and output just seems to

work.

How do you work with ASCII strings _at all_ in JScript?  All of its strings
are Unicode (as are most all that are handled by COM objects of any kind, or
by the internals of NT or Win2000, etc).  Of course, if you try to display
strings containing Unicode codepoints not covered by your current font[s],
you'll be somewhat out of luck, but that's unlikely to happen for chars in
the ASCII subset.  _Mostly_, needed conversions to/from single-byte or
multi-byte character codes happen reasonably smoothly... *as long as* you
have no need for compatibility with a language that originally handled
such single-byte or multi-byte strings directly.

Not that JavaScript (ECMAScript, JScript, etc) really is/was problem free
regarding Unicode support -- a highly recommended reading on this is:
http://www-4.ibm.com/software/developer/library/internationalization-...
.html
but, as the quoted article suggests, the kind of problems that lay in there
tended to be the kind "only noticed by serious Unicode mavens", i.e., the
people who *really need* good Unicode compliance!-)

Quote:
> Indeed, it's so transparent that
> to date, I've not really had to worry that Unicode exists at all.

If Unicode is all-pervasive, then it can indeed become well-nigh
transparent.  Java, Microsoft's scripting languages, Visual Basic,
&c, approximate this, somewhat.  (Actually, VB regularly gets in
trouble with Unicode-vs-single-byte issues when needing to call
into external non-COM DLL's -- trying to *hide* the distinction
gets troublesome, and it's horrible when the external software
may need any sophisticated transcoding -- but that is just one
case where Unicode is *NOT* all-pervasive...).

An existing language that is able to handle single-byte character
strings has, of necessity, a less-smooth path in evolving to
Unicode, if any backwards compatibility is desired (particularly
if it must run on platforms with very scarce native Unicode
situations).  I think Python has handled this reasonably well.
The alternative was basically to delay Unicode issues to the
"future [pie-in-the-sky?] not backwards-compatible Py3K", and
I, for one, am VERY glad this was not the decision made (or we'd
still be well-nigh Unicodeless today...).

Quote:
> However, in Python, I'm having all sorts of trouble. I can't print,

Why not?  You just have to know what encoding your printer
wants, and specify it explicitly.

Quote:
> I can't write Unicode strings to files

Again, why not?  Just specify the desired encoding.  On NT, the
typical encoding (the one Notepad uses, I believe) is UTF-16
(2 byte/character, an endianity mark at the start of file).  Note
that the endianity-mark must be ONLY on filestart, so the best
approach may be something like:

of=open("aac.txt","wb")
of.write(u"".encode("UTF-16"))
for word in u"fee fie foo fum".split():
    of.write(word.encode('UTF-16-le'))
    of.write(u"\r\n".encode('UTF-16-le'))
of.close()

or, equivalently (and adding a test to
make it work on big-endian platforms just
as well as on little-endian ones, and on
Unix and Mac as well as on Windows...):

import codecs, os

linesep = unicode(os.linesep)

of=open("aad.txt","wb")
of.write(codecs.BOM)

if codecs.BOM == codecs.BOM_BE:
    encoding = 'UTF-16-be'
else:
    encoding = 'UTF-16-le'

enc,dec,sr,sw = codecs.lookup(encoding)
of = sw(of)

for word in u"fee fie foo fum".split():
    of.write(word + linesep)
of.close()

i.e., Byte-Order Mark just once at the start, then
UTF-16-le (or -be) to avoid other spurious endianity
marks on following writes.  (Not sure why the stream
writer corresponding to 'UTF-16' does *NOT* maintain
enough state to know it must insert the endianity mark
just the FIRST time, but, it doesn't -- oh well; also,
textmode [non-binary] files have their own issues, as
the '\n'->'\r\n' translation, if it is to happen, would
happen in a non-Unicode-aware manner, thus wrongly; that
issue appears somewhat harder to fix in a general way,
but using a binary file and os.linesep ain't TOO bad).

Quote:
> and I can't find any documentation on how to handle Unicode.

I think (and I *dearly hope* I'm wrong) that the best doc
is still Lemburg's "proposal", at
http://www.lemburg.com/files/python/unicode-proposal.txt
Just read it 'imagining' it's not a proposal but an actual
description of how things work, mentalling translating all
the 'should do so-and-so' into 'does so-and-so'.  Not sure
why this rewording hasn't been actually performed (making
this document, e.g., a chapter or appendix of the Python
library reference, or whatever) -- as I say, I hope I'm wrong
here and the current 'real official' docs already have all
of this wealth of information, but I don't think they do.

Quote:
> I need to sort this out. A system that we are soon to put live has started
> blowing all over the place because the database is full of Unicode
strings.
> (Rather conveniently, our test suite wasn't pulling these out!)

> I could do with some help!

I hope this does help a bit...

Alex



Tue, 20 May 2003 03:00:00 GMT  
 More Unicode Trouble
Many thanks to all who took the trouble to reply - especially Alex who has single handedly (well,
almost) sorted out the bulk of my sticky Python problems.

I am somewhat the wiser from the responses but I have no clue as to what encoding to use. Also, as I
have yet to identify the database records that are causing the problems (lack of time), I don't even
know what characters a causing Python to blow.

The data has come from a variety of sources all over the world and has been merged into a single
(50MB) database.

How do I figure out which codec(?) to use?

--
Dale Strickland-Clark
Out-Think Ltd
Business Technology Consultants



Tue, 20 May 2003 03:00:00 GMT  
 More Unicode Trouble

Quote:

> Many thanks to all who took the trouble to reply - especially
> Alex who has single handedly (well, almost) sorted out the
> bulk of my sticky Python problems.

as far as I can tell, I was the only other person trying to
help you here.  don't worry, dale, I won't do that mistake
again.

</F>



Tue, 20 May 2003 03:00:00 GMT  
 More Unicode Trouble

Quote:


>> However, in Python, I'm having all sorts of trouble. I can't print,
>> I can't write Unicode strings to
>> files and I can't find any documentation on how to handle Unicode.

> What problems are you running into?  Be specific...

To be fair, this has been well documented on this newsgroup in periodic
bursts since Python 1.6.

People are getting Unicode values from external sources (COM, databases,
etc), and these often fail simply trying to _print_ the thing.
Considering how often "print" is used for debugging in Python, having
"print" itself blow up is highly frustrating.

I understand most of the issues, and why things are why they are.  But
that doesn't change the fact that things _do_ suck at the moment in
Python 2.0 with Unicode.  As Dale says, VB, JScript etc are Unicode at
the core, and you would never know.  Python is making its Unicode
feature very obvious at the moment, which it should _not_ be!

Mark.



Wed, 21 May 2003 08:19:39 GMT  
 More Unicode Trouble

Quote:

> What problems are you running into?  Be specific...

Actually, further to this:

I believe Python is unable to open certain files on Windows file
systems.  There was a post on this newsgroup about 2 months ago with an
example of an extended character in a filename and failure in opening
it.  I successfully created the filename on my filesystem, but failed to
get Python to open this file.  I also posted my failure to the group.  I
don't recall seeing another response with a working solution.

Again, I understand (basically) why this is, and upgrading Python's file
IO to be Unicode aware would fix it.

But we don't have a shortage of specific problems, what we do appear to
have is a shortage of specific solutions, or work-plans that would
indicate this is likely to be fixed in the 2.1 timeframe.

mea-culpa - I am on python-dev, and haven't stuck my hand up here
either.  However, as a speaker of only English, I am really over my head
before I start...

Mark.



Wed, 21 May 2003 08:26:31 GMT  
 More Unicode Trouble

Quote:


>> Many thanks to all who took the trouble to reply - especially
>> Alex who has single handedly (well, almost) sorted out the
>> bulk of my sticky Python problems.

>as far as I can tell, I was the only other person trying to
>help you here.  don't worry, dale, I won't do that mistake
>again.

></F>

I think you might have mistaken my post as sarcastic. It wasn't. You, and Andrew Kuchling, were
included in my thanks, which were genuine.

However, Alex has been of particular value in helping me understand some of the less well documented
aspects of Python in the last few months.

I am now also somewhat grateful to Mark for showing his sympathy for the problem.
--
Dale Strickland-Clark
Out-Think Ltd
Business Technology Consultants



Wed, 21 May 2003 03:00:00 GMT  
 More Unicode Trouble

Quote:

> > and I can't find any documentation on how to handle Unicode.

> I think (and I *dearly hope* I'm wrong) that the best doc
> is still Lemburg's "proposal", at
> http://www.*-*-*.com/
> Just read it 'imagining' it's not a proposal but an actual
> description of how things work, mentalling translating all
> the 'should do so-and-so' into 'does so-and-so'.  Not sure
> why this rewording hasn't been actually performed (making
> this document, e.g., a chapter or appendix of the Python
> library reference, or whatever) -- as I say, I hope I'm wrong
> here and the current 'real official' docs already have all
> of this wealth of information, but I don't think they do.

The above file was indeed the proposal that was used as
basis for the Unicode implementation in Python.

Many of the higher-level interfaces are already documented in the
standard documentation, but if you care about the internals
and all the{*filter*}details, then it still is the number one
reference.

Note that most of the trouble users have with Unicode comes from
not understanding the difference between 8-bit strings (without
any encoding information) and 2-byte (single encoding) Unicode --
these are really two different worlds and bringing them together
is a hard piece of work.

Other languages which seem to make
Unicode easy typically don't make this distinction at all: they
simply use Unicode all the way. But this is something we can't
do just yet in Python since it would break too much code.

--
Marc-Andre Lemburg
______________________________________________________________________
Company:                                         http://www.*-*-*.com/
Consulting:                                     http://www.*-*-*.com/
Python Pages:                           http://www.*-*-*.com/



Sat, 24 May 2003 08:22:39 GMT  
 
 [ 13 post ] 

 Relevant Pages 

1. Unicode trouble

2. Trouble with unicode

3. ANNOUNCE: unicode 0.4, command line unicode database query tool

4. Q: Unicode support for text edit control? (D5)

5. WIll Unicode solve char set problem?

6. WIll Unicode solve char set problem?

7. Unicode and Underscored alphabetics

8. Unicode and underscored alphabetics

9. Unicode mapping of APL characters

10. More Unicode

11. Got it: HTML, UNICODE, and Assembler

12. resend about HTML and UNICODE

 

 
Powered by phpBB® Forum Software