Author |
Message |
Mark McEaher #1 / 13
|
 comparing lists
Suppose I want to do a boolean comparison (equal or not equal) of two lists case-insensitively and I don't care about order. I can do this: 1. Create temporary copies that are lowercased and use the in operator: 2. Sort temporary copies, then then compare them. Are there other options? Is there a strong reason to prefer one approach over the other? Sample code below. Thanks, // mark #! /usr/bin/env python l = ['a', 'b', 'c'] m = ['C', 'B', 'A'] def lower(l): return [x.lower() for x in l] def compare_with_in(l, m): assert(len(l) == len(m)) t1 = lower(l) t2 = lower(m) for x in t1: assert(x in t2) def compare_with_equals(l, m): t1 = lower(l) t2 = lower(m) t1.sort() t2.sort() assert(t1==t2) compare_with_in(l, m) compare_with_equals(l, m)
|
Mon, 25 Oct 2004 05:40:26 GMT |
|
 |
Paul Magwen #2 / 13
|
 comparing lists
Quote:
> Suppose I want to do a boolean comparison (equal or not equal) of two > lists case-insensitively and I don't care about order. I can do this: > 1. Create temporary copies that are lowercased and use the in operator: > 2. Sort temporary copies, then then compare them. > Are there other options? Is there a strong reason to prefer one > approach over the other? > Sample code below. > Thanks, > // mark
If you know you're dealing with lists of strings why don't you use the "join" methods in the string module. For example: Quote: >>> import string >>> l = ['a','b','c'] >>> m = ['C','b','a'] >>> n = ['A','B','c'] >>> lstr = string.join(l) >>> mstr = string.join(m) >>> nstr = string.join(n) >>> lstr.lower() 'a b c' >>> lstr.lower() == nstr.lower() 1 >>> lstr.lower() == mstr.lower() 0
--Paul
|
Mon, 25 Oct 2004 06:31:08 GMT |
|
 |
Mark McEaher #3 / 13
|
 comparing lists
[Paul Magwene] Quote: > If you know you're dealing with lists of strings why don't you use the > "join" methods in the string module.
Thanks. I hadn't thought of that. Sadly, it won't work for me because as I said my comparison is order-insensitive whereas your approach is order-sensitive. Cheers, // mark
|
Mon, 25 Oct 2004 07:12:02 GMT |
|
 |
jep.. #4 / 13
|
 comparing lists
def compare_with_dict(l, m): d = {} if len(l) != len(m): return 0 # Cannot be equal (?) for i in l: d[i.lower()] = 0 for i in m: if i.lower() not in d: return 0 # Proven unequal return 1 Setting a key in a dict and checking for a key in dict ('d.has_key(k)' in older versions of Python) are supposed to be constant-time operations, so this is O(len(l) + len(m)) to run. Your versions either use sort (O(n lg n)) or 'in' with lists (O(n^2)), so this one should perform better than the original. However, you might want to use the dict-representation all the time. It'd save you the cost of the copy. If the original case is sometimes important, you might store the key as the lowercase version with the value as the original case (eg l['studlycaps'] == 'StUdLyCaPs') Quote: >>> t = CaseInsensitiveSet() >>> u = CaseInsensitiveSet() >>> print t == u 1 >>> t.add('a') >>> u.add('B') >>> print t == u 0 >>> t.add('b') >>> u.add('a') >>> print t, `u`, u['b'], u['B']
<Set a b> <Set 'a' 'b'> B B Quote: >>> print t == u 1 >>> u.remove('b') >>> print t == u 0 >>> t.remove('b') >>> print t == u
1 Here's an implementation for 2.2 (tested against 2.3a0 CVS): class CaseInsensitiveSet(dict): def __setitem__(self, item, value): item = item.lower() super(CaseInsensitiveSet, self).__setitem__(item, value) def __getitem__(self, item): item = item.lower() return super(CaseInsensitiveSet, self).__getitem__(item) def add(self, item): self[item] = item def remove(self, item): del self[item] def __eq__(self, other): for k in self: if not k in other: return 0 return 1 # for sk, ok in zip(self, other): # if sk != ok: return 0 # return def __str__(self): return "<Set %s>" % " ".join(self) def __repr__(self): return "<Set %s>" % " ".join(map(repr, self))
|
Mon, 25 Oct 2004 07:17:04 GMT |
|
 |
Delaney, Timoth #5 / 13
|
 comparing lists
Quote:
> You can always sort the lists before joining them. Or sort copies if > you need to leave the originals as-s. I.e. > ltmp = copy.copy(lst) > ltmp.sort() > l1 = ['ab ','c',' d'] > l2 = ['ab',' c ','d'] > l3 = ['ab c d']
That won't work - you need to perform the case-conversion before the sort. Quote: > These will all compare as equal. What you probably want to > do is add a > delimiter between the elements that you are pretty sure will not occur
^^^^^^^^^^^ Quote: > in the original strings, e.g.: > >>> string.join(['a','b','c'],'$%^').upper() > 'A$%^B$%^C' > or, > >>> '|-|'.join(['a','b','c']).upper() > 'A|-|B|-|C'
Make that 100% certain will not occur in the original strings ... I would suggest just sticking with converting to lowercase, sorting, then comparing. l1 = ['a', 'b', 'c'] # or any iterable l2 = ['C', 'a', 'B'] # or any iterable l1 = map(string.lower, l1) # or l1 = [x.lower() for x in l1] l2 = map(string.lower, l2) # or l2 = [x.lower() for x in l2] l1.sort() l2.sort() assert(l1 == l2) Tim Delaney
|
Mon, 25 Oct 2004 08:44:22 GMT |
|
 |
Lulu of the Lotus-Eater #6 / 13
|
 comparing lists
|[Paul Magwene] |> If you know you're dealing with lists of strings why don't you use the |> "join" methods in the string module.
|Thanks. I hadn't thought of that. Sadly, it won't work for me because as I |said my comparison is order-insensitive whereas your approach is |order-sensitive. You can always sort the lists before joining them. Or sort copies if you need to leave the originals as-s. I.e. ltmp = copy.copy(lst) ltmp.sort() ... There was a flaw, however, with Magwene's suggestion. Once you join the items, you lose some information about what was where originally. For example: l1 = ['ab ','c',' d'] l2 = ['ab',' c ','d'] l3 = ['ab c d'] These will all compare as equal. What you probably want to do is add a delimiter between the elements that you are pretty sure will not occur in the original strings, e.g.: >>> string.join(['a','b','c'],'$%^').upper() 'A$%^B$%^C' or, >>> '|-|'.join(['a','b','c']).upper() 'A|-|B|-|C' -- ---[ to our friends at TLAs (spread the word) ]-------------------------- Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad
|
Mon, 25 Oct 2004 08:18:26 GMT |
|
 |
Delaney, Timoth #7 / 13
|
 comparing lists
Here's a much simpler version that demonstrates the problem ... -- import string from time import clock from random import randrange dSize = 3 delim = ''.join(map(chr, [randrange(0,256) for _ in range(dSize)])) def DelaneyCompare(l1, l2): l1 = map(string.lower, l1) l2 = map(string.lower, l2) l1.sort() l2.sort() return l1 == l2 def LuluCompare(l1, l2): l1.sort() l2.sort() s1 = delim.join(l1).lower() s2 = delim.join(l2).lower() return s1 == s2 if __name__=='__main__': l1 = ['a', 'B'] l2 = ['A', 'b'] start = clock() print 'Delaney Result:', DelaneyCompare(l1, l2) print 'Time:', clock()-start start = clock() print 'Lulu Result: ', LuluCompare(l1, l2) print 'Time:', clock()-start ---------- Run ---------- Delaney Result: 1 Time: 0.000247198842009 Lulu Result: 0 Time: 9.83796172198e-005 In addition, the results seen with LuluCompare were heavily influenced by the fact that by having SIZE = 100000 l1 = ['aaaaa' for x in range(SIZE)] l2 = ['aaaaa' for x in range(SIZE)] you end up getting 200000 references to exactly the same object (the interned string 'aaaaa'). This will greatly speed up any sort routine (comparison of identity only required). DelaneyCompare OTOH was generating an additional 200000 different string objects, then comparing those. Always make sure your test data accurately reflects reality. Tim Delaney
|
Mon, 25 Oct 2004 11:46:00 GMT |
|
 |
Mark McEaher #8 / 13
|
 comparing lists
Quote: > Here's an implementation for 2.2 (tested against 2.3a0 CVS):
[...] Thanks, that's very helpful! // mark
|
Mon, 25 Oct 2004 10:53:07 GMT |
|
 |
Delaney, Timoth #9 / 13
|
 comparing lists
Quote: > r = [(randrange(0, len(str)), randrange(0, len(str))) for > x in range(randrange(0, len(str)))]
Ignore these horrible list comprehensions - they're bad. Normally I would have used a named function and map, but I was extending Lulu's code, and was lazy ... Tim Delaney
|
Mon, 25 Oct 2004 11:28:53 GMT |
|
 |
Delaney, Timoth #10 / 13
|
 comparing lists
[code snipped] import string from time import clock from random import randrange dSize = 3 delim = ''.join(map(chr, [randrange(0,256) for _ in range(dSize)])) def l1str(): """""" str = 'abcde' r = [(randrange(0, len(str)), randrange(0, len(str))) for x in range(randrange(0, len(str)))] for i, j in r: i, j = min(i, j), max(i, j) str = str[:i] + str[j:j+1] + str[i:j] + str[j+1:] return str def l2str (str): """""" r = [randrange(0, len(str)) for x in range(randrange(0, len(str)))] for i in r: str = str[:i] + str[i:i+1].upper() + str[i+1:] return str def DelaneyCompare(l1, l2): l1 = map(string.lower, l1) l2 = map(string.lower, l2) l1.sort() l2.sort() return l1 == l2 def LuluCompare(l1, l2): l1.sort() l2.sort() s1 = delim.join(l1).lower() s2 = delim.join(l2).lower() return s1 == s2 if __name__=='__main__': SIZE = 10 l1 = [l1str() for x in range(SIZE)] l2 = [l2str(s1) for s1 in l1] print l1 print l2 print start = clock() print 'Delaney Result:', DelaneyCompare(l1, l2) print 'Time:', clock()-start start = clock() print 'Lulu Result: ', LuluCompare(l1, l2) print 'Time:', clock()-start SIZE = 10000 l1 = [l1str() for x in range(SIZE)] l2 = [l2str(s1) for s1 in l1] print start = clock() print 'Delaney Result:', DelaneyCompare(l1, l2) print 'Time:', clock()-start start = clock() print 'Lulu Result: ', LuluCompare(l1, l2) print 'Time:', clock()-start ---------- Run ---------- ['aebcd', 'dcabe', 'adebc', 'acebd', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'baecd'] ['AebCD', 'dCAbe', 'aDeBC', 'aCebd', 'abcde', 'Abcde', 'abcDe', 'aBCDE', 'Abcde', 'bAeCd'] Delaney Result: 1 Time: 0.000317816168981 Lulu Result: 0 Time: 0.000125518683322 Delaney Result: 1 Time: 0.169594990082 Lulu Result: 0 Time: 0.0750326542647 Quote: > I like my version about 10x better :-).
I like my version infinitely better :) In your case, it didn't matter what order the data sorted in - it would always end up giving the same result since each entry was not only already lowercase, but all the entries were the same sequence of characters. In my case, I mixed up the order of characters (with each position in the two lists having the same order of characters) then mixed up the casing. Tim Delaney
|
Mon, 25 Oct 2004 11:25:30 GMT |
|
 |
Lulu of the Lotus-Eater #11 / 13
|
 comparing lists
|> ...What you probably want to do is add a delimiter |> between the elements that you are pretty sure will not occur ^^^^^^^^^^^
|Make that 100% certain will not occur in the original strings ... Nah... "pretty sure" is good enough. Specifically, if you use a random delimiter of length N, and a list of length K, you can be sure with a confidence of roughly K/(256**N). N doesn't have to get very large for pretty sure to be pretty *damn* sure :-). |I would suggest just sticking with converting to lowercase, sorting, then |comparing. |l1 = map(string.lower, l1) # or l1 = [x.lower() for x in l1] |l2 = map(string.lower, l2) # or l2 = [x.lower() for x in l2] |l1.sort() |l2.sort() Well... here's the thing: import string from time import clock from random import randrange dSize = 3 delim = ''.join(map(chr, [randrange(0,256) for _ in range(dSize)])) SIZE = 100000 l1 = ['aaaaa' for x in range(SIZE)] l2 = ['aaaaa' for x in range(SIZE)] def DelaneyCompare(l1, l2): l1 = map(string.lower, l1) l2 = map(string.lower, l2) l1.sort() l2.sort() return (l1==l2) def LuluCompare(l1, l2): l1.sort() l2.sort() s1 = delim.join(l1).lower() s2 = delim.join(l2).lower() return (s1==s2) if __name__=='__main__': start = clock() print 'Comparison Result:', DelaneyCompare(l1, l2) print 'Time:', clock()-start start = clock() print 'Comparison Result:', LuluCompare(l1, l2) print 'Time:', clock()-start I like my version about 10x better :-). Yours, Lulu... P.S. Yeah... I realized I cheated. The lowercasing is so much faster for the whole string than in a map() across all the little strings. In my hasty defense: if the strings really are insensitive, why not just store them in a lower() version when they are added to the list in the first place? P.P.S. If I stop cheating, I like my version about 0.95x as well as Tim's :-). --
gnosis | powers of IP- and crypto-tyranny have entered into an unholy .cx | alliance...ideas have nothing to lose but their chains. Unite | against "intellectual property" and anti-privacy regimes! -------------------------------------------------------------------------
|
Mon, 25 Oct 2004 10:34:20 GMT |
|
 |
Alex Martell #12 / 13
|
 comparing lists
Quote:
> Suppose I want to do a boolean comparison (equal or not equal) of two > lists case-insensitively and I don't care about order. I can do this:
"Don't care about order" -> use dictionaries. O(N) operation rather than: Quote: > 1. Create temporary copies that are lowercased and use the in operator:
O(N**2) Quote: > 2. Sort temporary copies, then then compare them.
O(N logN) Here's a O(N) solution: def compare_via_dicts(L, M): def makebag(L): dL = {} for x in L: x = x.lower() dL[x] = 1 + dL.get(x, 0) return makebag(L) == makebag(M) Note we use a 'bag' (x -> #instances of x in list), not a set, to take care of repetitions so ['a', 'A'] will not appear equal to ['a', 'A', 'A'], &c. Alex
|
Mon, 25 Oct 2004 14:37:31 GMT |
|
 |
mark mceahe #13 / 13
|
 comparing lists
[Alex Martelli] Quote: > Here's a O(N) solution: > def compare_via_dicts(L, M): > def makebag(L): > dL = {} > for x in L: > x = x.lower() > dL[x] = 1 + dL.get(x, 0) > return makebag(L) == makebag(M) > Note we use a 'bag' (x -> #instances of x in list), not a set, to take care > of repetitions so ['a', 'A'] will not appear equal to ['a', 'A', 'A'], &c.
Checking the lengths of the lists would also work to prevent ['a'] == ['a', 'a']. Thank you for this code. Cheers, // mark
|
Tue, 26 Oct 2004 20:57:15 GMT |
|
|
|