Tkinter wart: returned texts are sometimes strings, sometime Unicode strings 
Author Message
 Tkinter wart: returned texts are sometimes strings, sometime Unicode strings

Hi all,

I stepped again on a problem I had too many times, and I must ask: why on earth
do Tkinter return plain strings when getting a text value using only ASCII
charcaters and a Unicode string when the same text contains anything else? This
forces to do things like:

if type(text) == type(unicode('')): text = text.encode(...)

everywhere, which is:
1/ utterly ugly
2/ really too easy to forget

Why not return plain strings encoded as UTF-8, for example? That way, everything
should always work and one should never get UnicodeError's again, or even
stranger errors (like line endings printing as string '\n' instead of real line
endings...). I really don't see the point in returning once an object with one
type, and once with another type _depending on the user's input_... Returning a
string encoded as UTF-8 makes much more sense to me... If the user wants another
encoding, the text may be reencoded via unicode(...).encode(...) afterwards.

Any opinions?
--

PragmaDev : Real Time Software Development Tools - http://www.*-*-*.com/



Tue, 06 Sep 2005 02:11:35 GMT  
 Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Suppose that somebody enters the string containing U+00A1.  Returning any
non-unicode string (for instance '\xa1' (latin-1) or '\xc2\xa1' (utf-8))
is going to be wrong for some applications in some locales.

I assume that returning a plain string when possible is a space
optimization, but I wouldn't be sad to see it go (or become an internal
optimization, like the merger of "machine" and L-suffixed integers,
if this could in fact be done fairly painlessly).

IMO, it's intended that python code will automatically accept Unicode
strings anywhere regular strings were originally used, except in
interfaces which are explicitly byte-oriented.  In addition, several
facilities exist for common byte-oriented interfaces (file i/o being the
major one) to automatically encode the string into its byte-oriented
representation.

However, there is one thing I might be in favor of.  If you're working
in one of those rare environments where using sys.setdefaultencoding()
makes sense, then *maybe*  the following sequence should give you a
plain string rather than a unicode one:
    $ python -S
    Python 2.2.2 (#1, Oct 24 2002, 10:50:06)
    [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-110)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.setdefaultencoding("utf-8")
    >>> import site
    >>> import Tkinter
    >>> t = Tkinter.Entry()
    t.>>> t.pack()
    >>> t.insert(0, u"\xa1")
    >>> t.get()  # should possibly be '\xc2\xa1' instead
    u'\xa1'

By the way, in this statement
    > if type(text) == type(unicode('')): text = text.encode(...)
type(unicode('')) is unicode.  you should probably actually write this:
    if isinstance(text, unicode): text = text.encode(...)
Of course, it's harmless (but extra work) to .encode() a string:
    >>> "abcd".encode("utf-8")
    'abcd'
and if you've used sys.setdefaultencoding(), a plain str() can work:
    >>> str(t.get())
    '\xc2\xa1'
either of these techniques mean that you can derive versions of any
widget type, overriding the method that returns the unicode strings you
don't want.  Or you could subclass only Tk, wrapping self.tk with
something that returns encoded strings from the necessary methods.
Subclasses just copy the value of the parent's tk attribute, so there'd
be no need to change every method that might return a string, just the
methods on one object.

Jeff



Tue, 06 Sep 2005 04:02:38 GMT  
 Tkinter wart: returned texts are sometimes strings, sometime Unicode strings

Quote:

> Hi all,

> I stepped again on a problem I had too many times, and I must ask: why
> on earth do Tkinter return plain strings when getting a text value
> using only ASCII charcaters and a Unicode string when the same text
> contains anything else?

Which part are you questioning?

It returns sometimes ASCII for backwards compatibility: Applications
that were working for English in Python 1.5.2 continue to work.

It returns sometimes Unicode to represent all possible characters.

Quote:
> Why not return plain strings encoded as UTF-8, for example?

That may have been an option, but are you seriously proposing to
change that now?

Regards,
Martin



Tue, 06 Sep 2005 05:24:21 GMT  
 Tkinter wart: returned texts are sometimes strings, sometime Unicode strings

Quote:


>>Hi all,

>>I stepped again on a problem I had too many times, and I must ask: why
>>on earth do Tkinter return plain strings when getting a text value
>>using only ASCII charcaters and a Unicode string when the same text
>>contains anything else?

> Which part are you questioning?

> It returns sometimes ASCII for backwards compatibility: Applications
> that were working for English in Python 1.5.2 continue to work.

Sorry, but I don't think there are "applications that work for English": every
application may be used by people whose native language is not english and who
have funny characters directly on their keyboard that they are used to type
quite often. Getting a UnicodeError or some even stranger behaviour just because
they typed one of these really does not seem to be a good idea.

But what I'm actually questioning is not even that. It's that getting a text
from Tk does not return a consistent result: once a string, once a Unicode string.

And in fact, I don't see even the point in returning a string if the text
contains only plain ASCII: since Python default encoding *is* plain ASCII, what
would have been the problem if the text returned was also a Unicode string?

Quote:
> It returns sometimes Unicode to represent all possible characters.

>>Why not return plain strings encoded as UTF-8, for example?

> That may have been an option, but are you seriously proposing to
> change that now?

I'd really like to have opinions from other people who use Tkinter and whose
native language is not english. But IMHO, this would have been a far better
idea, at least until Unicode strings can be manipulated exactly the same way
plain strings are, which is not exactly the case today.

Anyway, thanks for your reply.
--

PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com



Tue, 06 Sep 2005 17:25:29 GMT  
 Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Thanks a lot! I think I'll definitely sub-class the Tk class to get the UTF8
encoding everytime.
--

PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com


Tue, 06 Sep 2005 17:44:39 GMT  
 Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
   ...

Quote:
> I'd really like to have opinions from other people who use Tkinter and
> whose native language is not english. But IMHO, this would have been a far
> better idea, at least until Unicode strings can be manipulated exactly the
> same way plain strings are, which is not exactly the case today.

Not exactly, but for example s.encode('utf-8') returns equivalent
plain-string objects whether s is itself an ASCII plain-string object
or a Unicode string object, and unicode(s) returns equivalent Unicode
objects whether s is an ASCII plain-string object or a Unicode string
object.  So, it's not TOO hard to compensate for Tkinter's attempts
to accomodate the user's convenience (which, like many other attempts
at providing convenience, may well end up being in one's way, sigh).

A workaround to ensure Tkinter's widgets' methods return Unicode
strings exclusively is therefore reasonably simple, on the lines of:

Quote:

>>> def wrapEnsuringUnicode(f):

...     def wrapper(*args, **kwds):
...         return unicode(f(*args, **kwds))
...     return wrapper
...

Quote:
>>> import Tkinter
>>> Tkinter.Misc.cget=wrapEnsuringUnicode(Tkinter.Misc.cget)
>>> root = Tkinter.Tk()
>>> root.cget('height')
u'0'

You could perform such wrapping either dynamically, or statically
in a modified Tkinter.py of your own or by inheritance.

A better fix might be to modify _tkinter.c to avoid the "smart"
way PyTclObject_string now strives to return plain string
objects when all contents are ASCII:

        if (!self->string) {
                s = Tcl_GetStringFromObj(self->value, &len);
                for (i = 0; i < len; i++)
                        if (s[i] & 0x80)
                                break;
#ifdef Py_USING_UNICODE
                if (i == len)
                        /* It is an ASCII string. */
                        self->string = PyString_FromStringAndSize(s, len);
                else {
                        self->string = PyUnicode_DecodeUTF8(s, len, "strict");
                        if (!self->string) {
                                PyErr_Clear();
                                self->string = PyString_FromStringAndSize(s, len);
                        }
                }
#else
                self->string = PyString_FromStringAndSize(s, len);
#endif

down to just:

        if (!self->string) {
                s = Tcl_GetStringFromObj(self->value, &len);
#ifdef Py_USING_UNICODE
                self->string = PyUnicode_DecodeUTF8(s, len, "strict");
                if (!self->string) {
                        PyErr_Clear();
                        self->string = PyString_FromStringAndSize(s, len);
                }
#else
                self->string = PyString_FromStringAndSize(s, len);
#endif

I'm not sure what this could break (indeed, I'm not even sure the
fallback to returning a string if decoding as utf-8 faiils is even
warranted).  But perhaps we're getting into areas more appropriate
for the python-dev list than for the general python list.

Alex



Tue, 06 Sep 2005 18:57:12 GMT  
 Tkinter wart: returned texts are sometimes strings, sometime Unicode strings

Quote:

> Sorry, but I don't think there are "applications that work for
> English": every application may be used by people whose native
> language is not english and who have funny characters directly on
> their keyboard that they are used to type quite often.

This is certainly not the case. I believe a significant number of
Tkinter applications is used by only a few people, all of which speak
only a single language, and can only type ASCII at their keyboards.

Quote:
> Getting a UnicodeError or some even stranger behaviour just because
> they typed one of these really does not seem to be a good idea.

Indeed, that's the reason for returning ASCII when possible: Too many
existing applications would have gotten UnicodeErrors otherwise.

Quote:
> And in fact, I don't see even the point in returning a string if the
> text contains only plain ASCII: since Python default encoding *is*
> plain ASCII, what would have been the problem if the text returned was
> also a Unicode string?

There might be API which you cannot pass Unicode objects to. There
certainly was when Unicode support was added to _tkinter. In
particular, libraries that dispatch based on the data type will fail.

Quote:
> I'd really like to have opinions from other people who use Tkinter and
> whose native language is not english. But IMHO, this would have been a
> far better idea, at least until Unicode strings can be manipulated
> exactly the same way plain strings are, which is not exactly the case
> today.

Can you elaborate? What operations are available on strings but not on
Unicode objects?

Regards,
Martin



Wed, 07 Sep 2005 01:25:59 GMT  
 Tkinter wart: returned texts are sometimes strings, sometime Unicode strings

Quote:

> A better fix might be to modify _tkinter.c to avoid the "smart"
> way PyTclObject_string now strives to return plain string
> objects when all contents are ASCII:

Would it be desirable to have this as a runtime configuration option?

[...]

Quote:
> I'm not sure what this could break

At the time this code was added, I believe there were reported
breakages. Not sure whether the same code would break with the current
Python still.

Quote:
> (indeed, I'm not even sure the fallback to returning a string if
> decoding as utf-8 faiils is even warranted).

Unfortunately, there is: When people pass a byte string to Tkinter
that has non-ASCII non-UTF8 sequences, Tk will assume it is encoded in
the locale's encoding, and render it as such. On returning it back to
Python, it will return it as it originally was, which may mean that
decoding to UTF-8 will fail.

Regards,
Martin



Wed, 07 Sep 2005 01:29:43 GMT  
 Tkinter wart: returned texts are sometimes strings, sometime Unicode strings

Quote:

> Can you elaborate? What operations are available on strings but not on
> Unicode objects?

Here is the funniest one:

 >>> class C1:
 >>>   def __repr__(self): return 'foo\nbar'
 >>> class C2:
 >>>   def __repr__(self): return u'foo\nbar'
 >>> print repr(C1())
foo
bar
 >>> print repr(C2())   # surprise!
foo\nbar

The first time I stepped on it, as you might guess, it took me several hours to
figure out what on earth was happening...

And I'm also in a context where, like Alex said, the default behaviour is in my
way: my Tkinter interface is above an application that ends up writing its data
in XML files, encoded in UTF-8 whatever the current default encoding is. So I
control the whole process from top to bottom: UTF-8 in the XML data files, UTF-8
in internal structures and UTF-8 displayed by Tkinter/Tk. So I obviously don't
care about the default encoding anywhere, but I guess it's just the way *I* do
things...
--

PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com



Wed, 07 Sep 2005 02:28:34 GMT  
 Tkinter wart: returned texts are sometimes strings, sometime Unicode strings

Quote:


>> A better fix might be to modify _tkinter.c to avoid the "smart"
>> way PyTclObject_string now strives to return plain string
>> objects when all contents are ASCII:

> Would it be desirable to have this as a runtime configuration option?

I think it might well be, yes.

Quote:
>> (indeed, I'm not even sure the fallback to returning a string if
>> decoding as utf-8 faiils is even warranted).

> Unfortunately, there is: When people pass a byte string to Tkinter
> that has non-ASCII non-UTF8 sequences, Tk will assume it is encoded in
> the locale's encoding, and render it as such. On returning it back to
> Python, it will return it as it originally was, which may mean that
> decoding to UTF-8 will fail.

OK... I sure wouldn't mind a way to get a warning for such
behavior on my program's part -- it sounds like it might be
something one might easily do accidentally, ending up with a
programs that "happens to work in some locales", but might
well prefer to fix upon learning about it.  That is, assuming
I'm correctly following your description of what's going on.

Alex



Wed, 07 Sep 2005 18:12:42 GMT  
 
 [ 10 post ] 

 Relevant Pages 

1. I am trying to copy a text string from a front panel indicator to a text

2. I am trying to copy a text string from a front panel indicator to a text

3. Parse string return unicode hex values

4. unicode strings and strings mix

5. Stylistic question: returning strings vs. pointers to strings

6. returning text between two strings

7. Tkinter string selection in Text widget

8. Ada String Issue: String within Strings

9. string = string(i:j) // string(k:n)

10. regexp/regsub operates veeery long on long strings sometimes

11. Displaying Unicode text in Tkinter Canvas widget

12. text.text and Hex display on string controls

 

 
Powered by phpBB® Forum Software