comparing lists 
Author Message
 comparing lists

Suppose I want to do a boolean comparison (equal or not equal) of two lists
case-insensitively and I don't care about order.  I can do this:

1.  Create temporary copies that are lowercased and use the in operator:

2.  Sort temporary copies, then then compare them.

Are there other options?  Is there a strong reason to prefer one approach
over the other?

Sample code below.

Thanks,

// mark

#! /usr/bin/env python

l = ['a', 'b', 'c']
m = ['C', 'B', 'A']

def lower(l):
    return [x.lower() for x in l]

def compare_with_in(l, m):
    assert(len(l) == len(m))
    t1 = lower(l)
    t2 = lower(m)
    for x in t1:
        assert(x in t2)

def compare_with_equals(l, m):
    t1 = lower(l)
    t2 = lower(m)
    t1.sort()
    t2.sort()
    assert(t1==t2)

compare_with_in(l, m)
compare_with_equals(l, m)



Mon, 25 Oct 2004 05:40:26 GMT  
 comparing lists

Quote:

> Suppose I want to do a boolean comparison (equal or not equal) of two
> lists case-insensitively and I don't care about order.  I can do this:

> 1.  Create temporary copies that are lowercased and use the in operator:

> 2.  Sort temporary copies, then then compare them.

> Are there other options?  Is there a strong reason to prefer one
> approach over the other?

> Sample code below.

> Thanks,

> // mark

If you know you're dealing with lists of strings why don't you use the
"join" methods in the string module.

For example:

Quote:
>>> import string
>>> l = ['a','b','c']
>>> m = ['C','b','a']
>>> n = ['A','B','c']
>>> lstr = string.join(l)
>>> mstr = string.join(m)
>>> nstr = string.join(n)
>>> lstr.lower()
'a b c'
>>> lstr.lower() == nstr.lower()
1
>>> lstr.lower() == mstr.lower()
0

--Paul


Mon, 25 Oct 2004 06:31:08 GMT  
 comparing lists
[Paul Magwene]

Quote:
> If you know you're dealing with lists of strings why don't you use the
> "join" methods in the string module.

Thanks.  I hadn't thought of that.  Sadly, it won't work for me because as I
said my comparison is order-insensitive whereas your approach is
order-sensitive.

Cheers,

// mark



Mon, 25 Oct 2004 07:12:02 GMT  
 comparing lists
def compare_with_dict(l, m):
    d = {}
    if len(l) != len(m): return 0 # Cannot be equal (?)
    for i in l: d[i.lower()] = 0
    for i in m:
        if i.lower() not in d: return 0 # Proven unequal
    return 1

Setting a key in a dict and checking for a key in dict ('d.has_key(k)' in
older versions of Python) are supposed to be constant-time operations, so
this is O(len(l) + len(m)) to run.

Your versions either use sort (O(n lg n)) or 'in' with lists (O(n^2)),
so this one should perform better than the original.

However, you might want to use the dict-representation all the time.  It'd
save you the cost of the copy.  If the original case is sometimes
important, you might store the key as the lowercase version with the value
as the original case (eg l['studlycaps'] == 'StUdLyCaPs')

Quote:
>>> t = CaseInsensitiveSet()
>>> u = CaseInsensitiveSet()
>>> print t == u
1
>>> t.add('a')
>>> u.add('B')
>>> print t == u
0
>>> t.add('b')
>>> u.add('a')
>>> print t, `u`, u['b'], u['B']

<Set a b> <Set 'a' 'b'> B B
Quote:
>>> print t == u
1
>>> u.remove('b')
>>> print t == u
0
>>> t.remove('b')
>>> print t == u

1

Here's an implementation for 2.2 (tested against 2.3a0 CVS):

class CaseInsensitiveSet(dict):
    def __setitem__(self, item, value):
        item = item.lower()
        super(CaseInsensitiveSet, self).__setitem__(item, value)

    def __getitem__(self, item):
        item = item.lower()
        return super(CaseInsensitiveSet, self).__getitem__(item)

    def add(self, item):
        self[item] = item

    def remove(self, item):
        del self[item]

    def __eq__(self, other):
        for k in self:
            if not k in other: return 0
        return 1
        # for sk, ok in zip(self, other):
        #     if sk != ok: return 0
        # return

    def __str__(self):
        return "<Set %s>" % " ".join(self)

    def __repr__(self):
        return "<Set %s>" % " ".join(map(repr, self))



Mon, 25 Oct 2004 07:17:04 GMT  
 comparing lists

Quote:

> You can always sort the lists before joining them.  Or sort copies if
> you need to leave the originals as-s.  I.e.

>     ltmp = copy.copy(lst)
>     ltmp.sort()

>     l1 = ['ab ','c',' d']
>     l2 = ['ab',' c ','d']
>     l3 = ['ab  c  d']

That won't work - you need to perform the case-conversion before the sort.

Quote:
> These will all compare as equal.  What you probably want to
> do is add a
> delimiter between the elements that you are pretty sure will not occur

                                              ^^^^^^^^^^^

Quote:
> in the original strings, e.g.:

>     >>> string.join(['a','b','c'],'$%^').upper()
>     'A$%^B$%^C'

> or,

>     >>> '|-|'.join(['a','b','c']).upper()
>     'A|-|B|-|C'

Make that 100% certain will not occur in the original strings ...

I would suggest just sticking with converting to lowercase, sorting, then
comparing.

l1 = ['a', 'b', 'c']          # or any iterable
l2 = ['C', 'a', 'B']          # or any iterable

l1 = map(string.lower, l1)    # or l1 = [x.lower() for x in l1]
l2 = map(string.lower, l2)    # or l2 = [x.lower() for x in l2]

l1.sort()
l2.sort()

assert(l1 == l2)

Tim Delaney



Mon, 25 Oct 2004 08:44:22 GMT  
 comparing lists
|[Paul Magwene]
|> If you know you're dealing with lists of strings why don't you use the
|> "join" methods in the string module.


|Thanks.  I hadn't thought of that.  Sadly, it won't work for me because as I
|said my comparison is order-insensitive whereas your approach is
|order-sensitive.

You can always sort the lists before joining them.  Or sort copies if
you need to leave the originals as-s.  I.e.

    ltmp = copy.copy(lst)
    ltmp.sort()
    ...

There was a flaw, however, with Magwene's suggestion.  Once you join the
items, you lose some information about what was where originally.  For
example:

    l1 = ['ab ','c',' d']
    l2 = ['ab',' c ','d']
    l3 = ['ab  c  d']

These will all compare as equal.  What you probably want to do is add a
delimiter between the elements that you are pretty sure will not occur
in the original strings, e.g.:

    >>> string.join(['a','b','c'],'$%^').upper()
    'A$%^B$%^C'

or,

    >>> '|-|'.join(['a','b','c']).upper()
    'A|-|B|-|C'

--
---[ to our friends at TLAs (spread the word) ]--------------------------
Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego
White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad



Mon, 25 Oct 2004 08:18:26 GMT  
 comparing lists
Here's a much simpler version that demonstrates the problem ...

--

import string
from time import clock
from random import randrange

dSize = 3
delim = ''.join(map(chr, [randrange(0,256) for _ in range(dSize)]))

def DelaneyCompare(l1, l2):
    l1 = map(string.lower, l1)
    l2 = map(string.lower, l2)
    l1.sort()
    l2.sort()
    return l1 == l2

def LuluCompare(l1, l2):
    l1.sort()
    l2.sort()
    s1 = delim.join(l1).lower()
    s2 = delim.join(l2).lower()
    return s1 == s2

if __name__=='__main__':

    l1 = ['a', 'B']
    l2 = ['A', 'b']

    start = clock()
    print 'Delaney Result:', DelaneyCompare(l1, l2)
    print 'Time:', clock()-start

    start = clock()
    print 'Lulu Result:   ', LuluCompare(l1, l2)
    print 'Time:', clock()-start

---------- Run ----------
Delaney Result: 1
Time: 0.000247198842009
Lulu Result:    0
Time: 9.83796172198e-005

In addition, the results seen with LuluCompare were heavily influenced by
the fact that by having

SIZE = 100000
l1 = ['aaaaa' for x in range(SIZE)]
l2 = ['aaaaa' for x in range(SIZE)]

you end up getting 200000 references to exactly the same object (the
interned string 'aaaaa'). This will greatly speed up any sort routine
(comparison of identity only required). DelaneyCompare OTOH was generating
an additional 200000 different string objects, then comparing those.

Always make sure your test data accurately reflects reality.

Tim Delaney



Mon, 25 Oct 2004 11:46:00 GMT  
 comparing lists

Quote:
> Here's an implementation for 2.2 (tested against 2.3a0 CVS):

[...]

Thanks, that's very helpful!

// mark



Mon, 25 Oct 2004 10:53:07 GMT  
 comparing lists

Quote:
>     r = [(randrange(0, len(str)), randrange(0, len(str))) for
> x in range(randrange(0, len(str)))]

Ignore these horrible list comprehensions - they're bad. Normally I would
have used a named function and map, but I was extending Lulu's code, and was
lazy ...

Tim Delaney



Mon, 25 Oct 2004 11:28:53 GMT  
 comparing lists

[code snipped]

import string
from time import clock
from random import randrange

dSize = 3
delim = ''.join(map(chr, [randrange(0,256) for _ in range(dSize)]))

def l1str():
    """"""
    str = 'abcde'
    r = [(randrange(0, len(str)), randrange(0, len(str))) for x in
range(randrange(0, len(str)))]

    for i, j in r:
        i, j = min(i, j), max(i, j)
        str = str[:i] + str[j:j+1] + str[i:j] + str[j+1:]

    return str

def l2str (str):
    """"""
    r = [randrange(0, len(str)) for x in range(randrange(0, len(str)))]

    for i in r:
        str = str[:i] + str[i:i+1].upper() + str[i+1:]

    return str

def DelaneyCompare(l1, l2):
    l1 = map(string.lower, l1)
    l2 = map(string.lower, l2)
    l1.sort()
    l2.sort()
    return l1 == l2

def LuluCompare(l1, l2):
    l1.sort()
    l2.sort()
    s1 = delim.join(l1).lower()
    s2 = delim.join(l2).lower()
    return s1 == s2

if __name__=='__main__':

    SIZE = 10

    l1 = [l1str() for x in range(SIZE)]
    l2 = [l2str(s1) for s1 in l1]

    print l1
    print l2

    print

    start = clock()
    print 'Delaney Result:', DelaneyCompare(l1, l2)
    print 'Time:', clock()-start

    start = clock()
    print 'Lulu Result:   ', LuluCompare(l1, l2)
    print 'Time:', clock()-start

    SIZE = 10000

    l1 = [l1str() for x in range(SIZE)]
    l2 = [l2str(s1) for s1 in l1]

    print

    start = clock()
    print 'Delaney Result:', DelaneyCompare(l1, l2)
    print 'Time:', clock()-start

    start = clock()
    print 'Lulu Result:   ', LuluCompare(l1, l2)
    print 'Time:', clock()-start

---------- Run ----------
['aebcd', 'dcabe', 'adebc', 'acebd', 'abcde', 'abcde', 'abcde', 'abcde',
'abcde', 'baecd']
['AebCD', 'dCAbe', 'aDeBC', 'aCebd', 'abcde', 'Abcde', 'abcDe', 'aBCDE',
'Abcde', 'bAeCd']

Delaney Result: 1
Time: 0.000317816168981
Lulu Result:    0
Time: 0.000125518683322

Delaney Result: 1
Time: 0.169594990082
Lulu Result:    0
Time: 0.0750326542647

Quote:
> I like my version about 10x better :-).

I like my version infinitely better :)

In your case, it didn't matter what order the data sorted in - it would
always end up giving the same result since each entry was not only already
lowercase, but all the entries were the same sequence of characters. In my
case, I mixed up the order of characters (with each position in the two
lists having the same order of characters) then mixed up the casing.

Tim Delaney



Mon, 25 Oct 2004 11:25:30 GMT  
 comparing lists


|> ...What you probably want to do is add a delimiter
|> between the elements that you are pretty sure will not occur
                                     ^^^^^^^^^^^

|Make that 100% certain will not occur in the original strings ...

Nah...  "pretty sure" is good enough.  Specifically, if you use a random
delimiter of length N, and a list of length K, you can be sure with a
confidence of roughly K/(256**N).  N doesn't have to get very large for
pretty sure to be pretty *damn* sure :-).

|I would suggest just sticking with converting to lowercase, sorting, then
|comparing.
|l1 = map(string.lower, l1)    # or l1 = [x.lower() for x in l1]
|l2 = map(string.lower, l2)    # or l2 = [x.lower() for x in l2]
|l1.sort()
|l2.sort()

Well... here's the thing:

    import string
    from time import clock
    from random import randrange

    dSize = 3
    delim = ''.join(map(chr, [randrange(0,256) for _ in range(dSize)]))

    SIZE = 100000
    l1 = ['aaaaa' for x in range(SIZE)]
    l2 = ['aaaaa' for x in range(SIZE)]

    def DelaneyCompare(l1, l2):
        l1 = map(string.lower, l1)
        l2 = map(string.lower, l2)
        l1.sort()
        l2.sort()
        return (l1==l2)

    def LuluCompare(l1, l2):
        l1.sort()
        l2.sort()
        s1 = delim.join(l1).lower()
        s2 = delim.join(l2).lower()
        return (s1==s2)

    if __name__=='__main__':
        start = clock()
        print 'Comparison Result:', DelaneyCompare(l1, l2)
        print 'Time:', clock()-start

        start = clock()
        print 'Comparison Result:', LuluCompare(l1, l2)
        print 'Time:', clock()-start

I like my version about 10x better :-).

Yours, Lulu...

P.S. Yeah...  I realized I cheated.  The lowercasing is so much faster
for the whole string than in a map() across all the little strings.  In
my hasty defense:  if the strings really are insensitive, why not just
store them in a lower() version when they are added to the list in the
first place?

P.P.S. If I stop cheating, I like my version about 0.95x as well as
Tim's :-).

--

gnosis  | powers of IP- and crypto-tyranny have entered into an unholy
.cx     | alliance...ideas have nothing to lose but their chains.  Unite
        | against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------



Mon, 25 Oct 2004 10:34:20 GMT  
 comparing lists

Quote:

> Suppose I want to do a boolean comparison (equal or not equal) of two
> lists case-insensitively and I don't care about order.  I can do this:

"Don't care about order" -> use dictionaries.  O(N) operation rather than:

Quote:
> 1.  Create temporary copies that are lowercased and use the in operator:

O(N**2)

Quote:
> 2.  Sort temporary copies, then then compare them.

O(N logN)

Here's a O(N) solution:

def compare_via_dicts(L, M):
    def makebag(L):
        dL = {}
        for x in L:
            x = x.lower()
            dL[x] = 1 + dL.get(x, 0)
    return makebag(L) == makebag(M)

Note we use a 'bag' (x -> #instances of x in list), not a set, to take care
of repetitions so ['a', 'A'] will not appear equal to ['a', 'A', 'A'], &c.

Alex



Mon, 25 Oct 2004 14:37:31 GMT  
 comparing lists
[Alex Martelli]

Quote:
> Here's a O(N) solution:

> def compare_via_dicts(L, M):
>     def makebag(L):
>         dL = {}
>         for x in L:
>             x = x.lower()
>             dL[x] = 1 + dL.get(x, 0)
>     return makebag(L) == makebag(M)

> Note we use a 'bag' (x -> #instances of x in list), not a set, to take care
> of repetitions so ['a', 'A'] will not appear equal to ['a', 'A', 'A'], &c.

Checking the lengths of the lists would also work to prevent ['a'] ==
['a', 'a'].  Thank you for this code.

Cheers,

// mark



Tue, 26 Oct 2004 20:57:15 GMT  
 
 [ 13 post ] 

 Relevant Pages 

1. NEWBIE QUESTIION: Comparing Lists in Python

2. Comparing Lists

3. compare lists

4. comparing lists

5. Comparing list element with 'sant

6. comparing lists

7. comparing lists and/or tuples 'by pointer'

8. How to compare infinite lists?

9. comparing all values of a list to regex

10. How to compare two lists ?

11. newbie: comparing mutually recursive lists

12. Comparing 2 lists

 

 
Powered by phpBB® Forum Software