Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Author |
Message |
Eric Brune #1 / 10
|
Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Hi all, I stepped again on a problem I had too many times, and I must ask: why on earth do Tkinter return plain strings when getting a text value using only ASCII charcaters and a Unicode string when the same text contains anything else? This forces to do things like: if type(text) == type(unicode('')): text = text.encode(...) everywhere, which is: 1/ utterly ugly 2/ really too easy to forget Why not return plain strings encoded as UTF-8, for example? That way, everything should always work and one should never get UnicodeError's again, or even stranger errors (like line endings printing as string '\n' instead of real line endings...). I really don't see the point in returning once an object with one type, and once with another type _depending on the user's input_... Returning a string encoded as UTF-8 makes much more sense to me... If the user wants another encoding, the text may be reencoded via unicode(...).encode(...) afterwards. Any opinions? --
PragmaDev : Real Time Software Development Tools - http://www.*-*-*.com/
|
Tue, 06 Sep 2005 02:11:35 GMT |
|
|
Jeff Eple #2 / 10
|
Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Suppose that somebody enters the string containing U+00A1. Returning any non-unicode string (for instance '\xa1' (latin-1) or '\xc2\xa1' (utf-8)) is going to be wrong for some applications in some locales. I assume that returning a plain string when possible is a space optimization, but I wouldn't be sad to see it go (or become an internal optimization, like the merger of "machine" and L-suffixed integers, if this could in fact be done fairly painlessly). IMO, it's intended that python code will automatically accept Unicode strings anywhere regular strings were originally used, except in interfaces which are explicitly byte-oriented. In addition, several facilities exist for common byte-oriented interfaces (file i/o being the major one) to automatically encode the string into its byte-oriented representation. However, there is one thing I might be in favor of. If you're working in one of those rare environments where using sys.setdefaultencoding() makes sense, then *maybe* the following sequence should give you a plain string rather than a unicode one: $ python -S Python 2.2.2 (#1, Oct 24 2002, 10:50:06) [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-110)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.setdefaultencoding("utf-8") >>> import site >>> import Tkinter >>> t = Tkinter.Entry() t.>>> t.pack() >>> t.insert(0, u"\xa1") >>> t.get() # should possibly be '\xc2\xa1' instead u'\xa1' By the way, in this statement > if type(text) == type(unicode('')): text = text.encode(...) type(unicode('')) is unicode. you should probably actually write this: if isinstance(text, unicode): text = text.encode(...) Of course, it's harmless (but extra work) to .encode() a string: >>> "abcd".encode("utf-8") 'abcd' and if you've used sys.setdefaultencoding(), a plain str() can work: >>> str(t.get()) '\xc2\xa1' either of these techniques mean that you can derive versions of any widget type, overriding the method that returns the unicode strings you don't want. Or you could subclass only Tk, wrapping self.tk with something that returns encoded strings from the necessary methods. Subclasses just copy the value of the parent's tk attribute, so there'd be no need to change every method that might return a string, just the methods on one object. Jeff
|
Tue, 06 Sep 2005 04:02:38 GMT |
|
|
Martin v. L?w #3 / 10
|
Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Quote:
> Hi all, > I stepped again on a problem I had too many times, and I must ask: why > on earth do Tkinter return plain strings when getting a text value > using only ASCII charcaters and a Unicode string when the same text > contains anything else?
Which part are you questioning? It returns sometimes ASCII for backwards compatibility: Applications that were working for English in Python 1.5.2 continue to work. It returns sometimes Unicode to represent all possible characters. Quote: > Why not return plain strings encoded as UTF-8, for example?
That may have been an option, but are you seriously proposing to change that now? Regards, Martin
|
Tue, 06 Sep 2005 05:24:21 GMT |
|
|
Eric Brune #4 / 10
|
Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Quote:
>>Hi all, >>I stepped again on a problem I had too many times, and I must ask: why >>on earth do Tkinter return plain strings when getting a text value >>using only ASCII charcaters and a Unicode string when the same text >>contains anything else? > Which part are you questioning? > It returns sometimes ASCII for backwards compatibility: Applications > that were working for English in Python 1.5.2 continue to work.
Sorry, but I don't think there are "applications that work for English": every application may be used by people whose native language is not english and who have funny characters directly on their keyboard that they are used to type quite often. Getting a UnicodeError or some even stranger behaviour just because they typed one of these really does not seem to be a good idea. But what I'm actually questioning is not even that. It's that getting a text from Tk does not return a consistent result: once a string, once a Unicode string. And in fact, I don't see even the point in returning a string if the text contains only plain ASCII: since Python default encoding *is* plain ASCII, what would have been the problem if the text returned was also a Unicode string? Quote: > It returns sometimes Unicode to represent all possible characters. >>Why not return plain strings encoded as UTF-8, for example? > That may have been an option, but are you seriously proposing to > change that now?
I'd really like to have opinions from other people who use Tkinter and whose native language is not english. But IMHO, this would have been a far better idea, at least until Unicode strings can be manipulated exactly the same way plain strings are, which is not exactly the case today. Anyway, thanks for your reply. --
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com
|
Tue, 06 Sep 2005 17:25:29 GMT |
|
|
Eric Brune #5 / 10
|
Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Thanks a lot! I think I'll definitely sub-class the Tk class to get the UTF8 encoding everytime. --
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com
|
Tue, 06 Sep 2005 17:44:39 GMT |
|
|
Alex Martell #6 / 10
|
Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
... Quote: > I'd really like to have opinions from other people who use Tkinter and > whose native language is not english. But IMHO, this would have been a far > better idea, at least until Unicode strings can be manipulated exactly the > same way plain strings are, which is not exactly the case today.
Not exactly, but for example s.encode('utf-8') returns equivalent plain-string objects whether s is itself an ASCII plain-string object or a Unicode string object, and unicode(s) returns equivalent Unicode objects whether s is an ASCII plain-string object or a Unicode string object. So, it's not TOO hard to compensate for Tkinter's attempts to accomodate the user's convenience (which, like many other attempts at providing convenience, may well end up being in one's way, sigh). A workaround to ensure Tkinter's widgets' methods return Unicode strings exclusively is therefore reasonably simple, on the lines of: Quote: >>> def wrapEnsuringUnicode(f):
... def wrapper(*args, **kwds): ... return unicode(f(*args, **kwds)) ... return wrapper ... Quote: >>> import Tkinter >>> Tkinter.Misc.cget=wrapEnsuringUnicode(Tkinter.Misc.cget) >>> root = Tkinter.Tk() >>> root.cget('height') u'0'
You could perform such wrapping either dynamically, or statically in a modified Tkinter.py of your own or by inheritance. A better fix might be to modify _tkinter.c to avoid the "smart" way PyTclObject_string now strives to return plain string objects when all contents are ASCII: if (!self->string) { s = Tcl_GetStringFromObj(self->value, &len); for (i = 0; i < len; i++) if (s[i] & 0x80) break; #ifdef Py_USING_UNICODE if (i == len) /* It is an ASCII string. */ self->string = PyString_FromStringAndSize(s, len); else { self->string = PyUnicode_DecodeUTF8(s, len, "strict"); if (!self->string) { PyErr_Clear(); self->string = PyString_FromStringAndSize(s, len); } } #else self->string = PyString_FromStringAndSize(s, len); #endif down to just: if (!self->string) { s = Tcl_GetStringFromObj(self->value, &len); #ifdef Py_USING_UNICODE self->string = PyUnicode_DecodeUTF8(s, len, "strict"); if (!self->string) { PyErr_Clear(); self->string = PyString_FromStringAndSize(s, len); } #else self->string = PyString_FromStringAndSize(s, len); #endif I'm not sure what this could break (indeed, I'm not even sure the fallback to returning a string if decoding as utf-8 faiils is even warranted). But perhaps we're getting into areas more appropriate for the python-dev list than for the general python list. Alex
|
Tue, 06 Sep 2005 18:57:12 GMT |
|
|
Martin v. L?w #7 / 10
|
Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Quote:
> Sorry, but I don't think there are "applications that work for > English": every application may be used by people whose native > language is not english and who have funny characters directly on > their keyboard that they are used to type quite often.
This is certainly not the case. I believe a significant number of Tkinter applications is used by only a few people, all of which speak only a single language, and can only type ASCII at their keyboards. Quote: > Getting a UnicodeError or some even stranger behaviour just because > they typed one of these really does not seem to be a good idea.
Indeed, that's the reason for returning ASCII when possible: Too many existing applications would have gotten UnicodeErrors otherwise. Quote: > And in fact, I don't see even the point in returning a string if the > text contains only plain ASCII: since Python default encoding *is* > plain ASCII, what would have been the problem if the text returned was > also a Unicode string?
There might be API which you cannot pass Unicode objects to. There certainly was when Unicode support was added to _tkinter. In particular, libraries that dispatch based on the data type will fail. Quote: > I'd really like to have opinions from other people who use Tkinter and > whose native language is not english. But IMHO, this would have been a > far better idea, at least until Unicode strings can be manipulated > exactly the same way plain strings are, which is not exactly the case > today.
Can you elaborate? What operations are available on strings but not on Unicode objects? Regards, Martin
|
Wed, 07 Sep 2005 01:25:59 GMT |
|
|
Martin v. L?w #8 / 10
|
Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Quote:
> A better fix might be to modify _tkinter.c to avoid the "smart" > way PyTclObject_string now strives to return plain string > objects when all contents are ASCII:
Would it be desirable to have this as a runtime configuration option? [...] Quote: > I'm not sure what this could break
At the time this code was added, I believe there were reported breakages. Not sure whether the same code would break with the current Python still. Quote: > (indeed, I'm not even sure the fallback to returning a string if > decoding as utf-8 faiils is even warranted).
Unfortunately, there is: When people pass a byte string to Tkinter that has non-ASCII non-UTF8 sequences, Tk will assume it is encoded in the locale's encoding, and render it as such. On returning it back to Python, it will return it as it originally was, which may mean that decoding to UTF-8 will fail. Regards, Martin
|
Wed, 07 Sep 2005 01:29:43 GMT |
|
|
Eric Brune #9 / 10
|
Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Quote:
> Can you elaborate? What operations are available on strings but not on > Unicode objects?
Here is the funniest one: >>> class C1: >>> def __repr__(self): return 'foo\nbar' >>> class C2: >>> def __repr__(self): return u'foo\nbar' >>> print repr(C1()) foo bar >>> print repr(C2()) # surprise! foo\nbar The first time I stepped on it, as you might guess, it took me several hours to figure out what on earth was happening... And I'm also in a context where, like Alex said, the default behaviour is in my way: my Tkinter interface is above an application that ends up writing its data in XML files, encoded in UTF-8 whatever the current default encoding is. So I control the whole process from top to bottom: UTF-8 in the XML data files, UTF-8 in internal structures and UTF-8 displayed by Tkinter/Tk. So I obviously don't care about the default encoding anywhere, but I guess it's just the way *I* do things... --
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com
|
Wed, 07 Sep 2005 02:28:34 GMT |
|
|
Alex Martell #10 / 10
|
Tkinter wart: returned texts are sometimes strings, sometime Unicode strings
Quote:
>> A better fix might be to modify _tkinter.c to avoid the "smart" >> way PyTclObject_string now strives to return plain string >> objects when all contents are ASCII: > Would it be desirable to have this as a runtime configuration option?
I think it might well be, yes. Quote: >> (indeed, I'm not even sure the fallback to returning a string if >> decoding as utf-8 faiils is even warranted). > Unfortunately, there is: When people pass a byte string to Tkinter > that has non-ASCII non-UTF8 sequences, Tk will assume it is encoded in > the locale's encoding, and render it as such. On returning it back to > Python, it will return it as it originally was, which may mean that > decoding to UTF-8 will fail.
OK... I sure wouldn't mind a way to get a warning for such behavior on my program's part -- it sounds like it might be something one might easily do accidentally, ending up with a programs that "happens to work in some locales", but might well prefer to fix upon learning about it. That is, assuming I'm correctly following your description of what's going on. Alex
|
Wed, 07 Sep 2005 18:12:42 GMT |
|
|
|