Python and UTF-8 
Author Message
 Python and UTF-8

I'm making a small automated managing system for my website and I
think I will go for python and CGI. I have one question though: my
website is in Norwegian and in valid XHTML. All characters need to be
encoded in UTF-8. The way I'm currently doing it is that all articles
go through me for formatting and validation. When switching to a CMS
everybody with access to the system will be able to put articles on my
website and this means that all kinds of formatting will be used. This
will make the W3C validator {*filter*}and my site would not validate.

Is it possible to make a python script that would change the character
to UTF-8 no matter what the encoding of the input is? I have heard
that Python has some great functions for Unicode formatting so this
might be an easy and trivial task, but I'm new to Python so I really
don't know...

-Brandvik



Sun, 20 Jun 2004 21:01:01 GMT  
 Python and UTF-8
Just a related topic:

I have a similar problem:

I would like to make an online service enabling to recode latin letters
into the georgian alphabet (quite independent, one of the 14 existing
ones).

What I know are the the georgian encoding, see e.g.,
http://www.*-*-*.com/

and the rules the latin characters are to be substituted by the georgian
ones.

What I do not know is the way I can forse Python to print the octals in
the right manner...

I am affraid that this question is not for this group... If so, please,
accept my appologoes:(

Best wishes,
Giorgi

Quote:

> I'm making a small automated managing system for my website and I
> think I will go for Python and CGI. I have one question though: my
> website is in Norwegian and in valid XHTML. All characters need to be
> encoded in UTF-8. The way I'm currently doing it is that all articles
> go through me for formatting and validation. When switching to a CMS
> everybody with access to the system will be able to put articles on my
> website and this means that all kinds of formatting will be used. This
> will make the W3C validator {*filter*}and my site would not validate.

> Is it possible to make a python script that would change the character
> to UTF-8 no matter what the encoding of the input is? I have heard
> that Python has some great functions for Unicode formatting so this
> might be an easy and trivial task, but I'm new to Python so I really
> don't know...

> -Brandvik



Mon, 21 Jun 2004 08:43:03 GMT  
 Python and UTF-8

Quote:

> Is it possible to make a python script that would change the character
> to UTF-8 no matter what the encoding of the input is? I have heard
> that Python has some great functions for Unicode formatting so this
> might be an easy and trivial task, but I'm new to Python so I really
> don't know...

You have to know the encoding the data is currently, say
current_encoding. Then, converting it into UTF-8, you write

data = unicode(data, current_encoding).encode('utf-8')

HTH,
Martin



Mon, 21 Jun 2004 05:28:47 GMT  
 Python and UTF-8
I wonder if you find time to give me a hint where can I get the information
on which encoding schemes are currently suported in Python...
Grtz,
Giorgi
Quote:


> > Is it possible to make a python script that would change the character
> > to UTF-8 no matter what the encoding of the input is? I have heard
> > that Python has some great functions for Unicode formatting so this
> > might be an easy and trivial task, but I'm new to Python so I really
> > don't know...

> You have to know the encoding the data is currently, say
> current_encoding. Then, converting it into UTF-8, you write

> data = unicode(data, current_encoding).encode('utf-8')

> HTH,
> Martin



Tue, 22 Jun 2004 02:45:30 GMT  
 Python and UTF-8


Quote:

> You have to know the encoding the data is currently, say
> current_encoding. Then, converting it into UTF-8, you write

> data = unicode(data, current_encoding).encode('utf-8')

Yes, but what if I don't know?

I have been playing around with Unicode/Python/Tkinter, but I just don't
get is... Seems like I need some help on Unicode...
Would there be a good tutorial or book on this topic which answers
questions like:
How does Python handle Unicode-files?
How does sorting work with Unicode?
Can I use locales with Unicode (e.g. to sort words according to the
German convention?) How?
How to use regular expressions with Unicode?
etc.

Thanks, Matthias



Mon, 21 Jun 2004 20:39:31 GMT  
 Python and UTF-8


Quote:
> I wonder if you find time to give me a hint where can I get the
information
> on which encoding schemes are currently suported in Python...
> Grtz,
> Giorgi

It doesn't seem to be very well documented. The only way I
ever figured it out was to go into the Python21/Lib/encodings directory
and look.
Quote:



> > > Is it possible to make a python script that would change the
character
> > > to UTF-8 no matter what the encoding of the input is? I have heard
> > > that Python has some great functions for Unicode formatting so
this
> > > might be an easy and trivial task, but I'm new to Python so I
really
> > > don't know...

> > You have to know the encoding the data is currently, say
> > current_encoding. Then, converting it into UTF-8, you write

> > data = unicode(data, current_encoding).encode('utf-8')

> > HTH,
> > Martin



Mon, 21 Jun 2004 22:47:31 GMT  
 Python and UTF-8
Giorgi,

I'm interested in your problem; I'm learning about encodings in Python
for other purposes at the moment.

I am somewhat familiar with the Georgian alphabet. (I once tried to
translate an aria from paliashvili's _abessalom da eteri_ for my wife
-- but I ended up translating from the Russian. No offense, but
Georgian really is an *impossible* language!). There are certainly
some interesting issues: Capitalization, for one. And do you want the
reencoding to be, essentially, a transliteration?

Perhaps we should take this offline; you can contact me at my personal
email address, fbartlet (at) optonline (dot) net.

Fred

Quote:

> Just a related topic:

> I have a similar problem:

> I would like to make an online service enabling to recode latin letters
> into the georgian alphabet (quite independent, one of the 14 existing
> ones).

> What I know are the the georgian encoding, see e.g.,
> http://www.*-*-*.com/

> and the rules the latin characters are to be substituted by the georgian
> ones.

> What I do not know is the way I can forse Python to print the octals in
> the right manner...

> I am affraid that this question is not for this group... If so, please,
> accept my appologoes:(

> Best wishes,
> Giorgi


> > I'm making a small automated managing system for my website and I
> > think I will go for Python and CGI. I have one question though: my
> > website is in Norwegian and in valid XHTML. All characters need to be
> > encoded in UTF-8. The way I'm currently doing it is that all articles
> > go through me for formatting and validation. When switching to a CMS
> > everybody with access to the system will be able to put articles on my
> > website and this means that all kinds of formatting will be used. This
> > will make the W3C validator {*filter*}and my site would not validate.

> > Is it possible to make a python script that would change the character
> > to UTF-8 no matter what the encoding of the input is? I have heard
> > that Python has some great functions for Unicode formatting so this
> > might be an easy and trivial task, but I'm new to Python so I really
> > don't know...

> > -Brandvik



Mon, 21 Jun 2004 23:22:04 GMT  
 Python and UTF-8

Quote:
> I wonder if you find time to give me a hint where can I get the information
> on which encoding schemes are currently suported in Python...

On Unix systems, Python 2.1,

        /usr/lib/python2.1/encodings/

--
D?DY

            The dark ages were caused by the Y1K problem.



Mon, 21 Jun 2004 23:44:55 GMT  
 Python and UTF-8

Quote:

> I wonder if you find time to give me a hint where can I get the
> information on which encoding schemes are currently suported in
> Python...

Currently, there is no programmatic way to find all supported
encodings. In general, take the IANA registry of character sets at

http://www.iana.org/assignments/character-sets

as a starting point, and use the names in that registry (if there is a
preferred MIME name as an alias, use that, otherwise use the Name, not
an Alias).

To get an idea of what is supported, see

<prefix>/lib/python<version>/encodings

For a module foo_bar.py in this directory, you can use alternative
spellings, like "foo-bar", or "foo_BAR" (i.e. all of ISO-8859-1,
iso-8859-1, and ISO_8859-1 refer to the same encoding).

If there is an encoding that is not supported by Python which you
need, please report it as a bug at sf.net/projects/python. Meanwhile,
you may find the additional codecs at sf.net/projects/python-codecs
useful. In particular, on Unix, the iconv codec will give you access
to many additional encodings (depending on your operating system).

HTH,
Martin



Tue, 22 Jun 2004 01:29:05 GMT  
 Python and UTF-8

Quote:

> > You have to know the encoding the data is currently, say
> > current_encoding. Then, converting it into UTF-8, you write

> > data = unicode(data, current_encoding).encode('utf-8')

> Yes, but what if I don't know?

If you get byte data from some source, and want to interpret those
byte data as character strings, you *have* to know the encoding - if
you don't, consider re-architecting your application so that you do.
If you still cannot know, guess. If you guess wrong often enough, your
users will complain so that they are willing to accept additional
infrastructure to properly identify the encoding of byte data.

Quote:
> How does Python handle Unicode-files?

There is no such thing as a Unicode file. Files are byte-oriented on
all systems I know. So when opening a file, you need to specify the
encoding. You can use codecs.open to read from a file and get Unicode
strings out of it.

Quote:
> How does sorting work with Unicode?

By default, it sorts by Unicode numeral value.

Quote:
> Can I use locales with Unicode (e.g. to sort words according to the
> German convention?) How?

You sort plain (byte) strings according to locale with
locale.strcoll. In theory, this function ought to work for Unicode
strings, too; it is a bug that it currently doesn't.

To work around this, you need to encode the Unicode strings into the
locale's character set, and compare the resulting byte strings with
strcoll.

Quote:
> How to use regular expressions with Unicode?

Just use the re module: it fully supports Unicode.

Regards,
Martin



Tue, 22 Jun 2004 01:58:58 GMT  
 Python and UTF-8

Quote:
> I am somewhat familiar with the Georgian alphabet. (I once tried to
> translate an aria from paliashvili's _abessalom da eteri_ for my wife
> -- but I ended up translating from the Russian. No offense, but
> Georgian really is an *impossible* language!). There are certainly
> some interesting issues: Capitalization, for one. And do you want the
> reencoding to be, essentially, a transliteration?

It depends in what format you have the Georgian data. E.g. if you have
them in Mac OS Georgian, you can define a codec that encodes to
Unicode.

Regards,
Martin



Tue, 22 Jun 2004 02:04:29 GMT  
 Python and UTF-8


Quote:
> There is no such thing as a Unicode file. Files are byte-oriented on
> all systems I know. So when opening a file, you need to specify the
> encoding. You can use codecs.open to read from a file and get Unicode
> strings out of it.

Hhmm, but how come that reading a text file with Python and displaying it
in a Tkinter text widget (with a Unicode font) will show the text just
fine -- regardless of the encoding used to save the file (Latin-1 or UTF-
8) and without specifying the encoding when opening it. Does Python guess
itself? As I said in my earlier posting, I just don't get how it works...

Quote:
> You sort plain (byte) strings according to locale with
> locale.strcoll. In theory, this function ought to work for Unicode
> strings, too; it is a bug that it currently doesn't.

Okay, this explains the trouble I got into, when trying to use
locale.strcoll with Unicode strings...

Thanks, Martin!

Matthias



Tue, 22 Jun 2004 02:46:06 GMT  
 Python and UTF-8


Quote:
>> How to use regular expressions with Unicode?

> Just use the re module: it fully supports Unicode.

Not really...
At least the combination of re.I and re.U fails on texts in German.
But that again could be a result of the combination of 'locale' and
Unicode, right?
I tried this (Win 98, Python 2.1, Idle):

----------------------

Quote:
>>> import locale
>>> locale.setlocale(locale.LC_ALL,"")

'German_Germany.1252'
Quote:
>>> t = 'Mhsam ern?hrt sich das Eichh?rnchen.'
>>> print t.upper()

MHSAM ERN?HRT SICH DAS EICHH?RNCHEN.
Quote:
>>> tu = unicode(t, 'latin-1').encode('utf-8')
>>> print tu.upper()

MHSAM ERN?HRT SICH DAS EICHH?RNCHEN.
----------------------

This should work, I think. But it doesn't.
Did I miss something?

Matthias



Tue, 22 Jun 2004 03:41:24 GMT  
 Python and UTF-8
On Thu, 3 Jan 2002 20:41:24 +0100, Matthias Huening

Quote:
>----------------------
>>>> import locale
>>>> locale.setlocale(locale.LC_ALL,"")
>'German_Germany.1252'
>>>> t = 'Mhsam ern?hrt sich das Eichh?rnchen.'
>>>> print t.upper()
>MHSAM ERN?HRT SICH DAS EICHH?RNCHEN.
>>>> tu = unicode(t, 'latin-1').encode('utf-8')
>>>> print tu.upper()

Won't work. 'upper' doesn't know anything about the utf-8-encoding. it
assumes cp1252 according to the locale settings. 'tu' isn't an unicode
string.

Quote:
>MHSAM ERN?HRT SICH DAS EICHH?RNCHEN.

>----------------------

(W2K, Ger)

Welcome To PyCrust 0.7 - The Flakiest Python Shell
Sponsored by Orbtech.com - Your Source For Python Development Services
Python 2.2 (#28, Dec 21 2001, 12:21:22) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.

Quote:
>>> import locale
>>> locale.setlocale(locale.LC_ALL,"")

'German_Germany.1252'
Quote:
>>> t="Mhsam ern?hrt sich das Eichh?rnchen"
>>> tu=unicode(t,"latin-1")
>>> tu

u'M\xfchsam ern\xe4hrt sich das Eichh\xf6rnchen'
Quote:
>>> tu.upper()

u'M\xdcHSAM ERN\xc4HRT SICH DAS EICHH\xd6RNCHEN'
Quote:
>>> print tu.upper().encode("latin-1")

MHSAM ERN?HRT SICH DAS EICHH?RNCHEN

--
Wir danken fr die Beachtung aller Sicherheitsbestimmungen



Tue, 22 Jun 2004 06:15:09 GMT  
 Python and UTF-8
On Thu, 3 Jan 2002 20:41:24 +0100, Matthias Huening

Quote:


>>> How to use regular expressions with Unicode?

>> Just use the re module: it fully supports Unicode.

>Not really...
>At least the combination of re.I and re.U fails on texts in German.
>But that again could be a result of the combination of 'locale' and
>Unicode, right?

S.I doesn't honor the locale settings. From the manual:

IGNORECASE
Perform case-insensitive matching; expressions like [A-Z] will match
lowercase letters, too. This is not affected by the current locale

--
Thank you for observing all safety precautions



Tue, 22 Jun 2004 06:15:13 GMT  
 
 [ 20 post ]  Go to page: [1] [2]

 Relevant Pages 

1. Unicode (UTF-16 and UTF-32) data types

2. UTF-8 usage in Python 2.0

3. UTF-8 conversion

4. write out xml in utf-8?

5. getfolderitem and utf-16

6. Using Text Encoding to create UTF-8 output?

7. Unicode UTF-8 in Forth

8. UTF 16?

9. UTF-8 "bug": not in accordance with the unicode-3 specs

10. UTF-8 Character Conversion to HTML

11. Ruby/Tk + UTF-8

12. imap and utf-7

 

 
Powered by phpBB® Forum Software