Comparing two lists for differences 
Author Message
 Comparing two lists for differences

Hello.

I need a proc that will basically compare srclist with destlist, and
will return two lists:

- list of items to add
- list of items to delete

Here's my working example:

proc lcompare {l1 l2} {
     set nl [list]; # new elements list
     set rl [list]; # removed elements list
     foreach le $l1 {
         if {[lsearch -exact $l2 $le]<0} {
             lappend rl $le
         }
     }
     foreach le $l2 {
         if {[lsearch -exact $l1 $le]<0} {
             lappend nl $le
         }
     }
     return [list $nl $rl]

Quote:
}

The problem is that I will need to use this function to compare larger
lists, most of which remain unchanged.

When comparing lists of about 7000 elements, it takes 10s to compare it
on a P3/850. And my software will probably have to compare about 50k
elements...

I'm looking for a way (Tcl or C, or both :) to do it faster.

For now I guess that if I'll sort these lists and compare them at once,
it should speed up. This is just a though though...

Has anyone done similar things before? Any clues or maybe some code? :-)

--
WK

"Data typing is an illusion. Everything is a sequence of bytes."
                                                              -Todd Coram



Wed, 01 Dec 2004 21:22:36 GMT  
 Comparing two lists for differences
##
## Using plain Tcl
## This is good for large list, bad for small list
##
foreach x $srclist {set srcArray ($x) 1}
foreach x $destist {set destArray ($x) 1}
set addList {}
set delList {}
foreach x $srcList {
    if {![info exists destArray($x)]} then {
        lappend addList $x
    }
Quote:
}

foreach x $destList {
    if {![info exists srcArray($x)]} then {
        lappend delList $x
    }

Quote:
}

##
## Using TclX
##
set results [intersect3 lista listb
set addList [lindex $results 0]
set delList [lindex $results 2]

Quote:

> Hello.

> I need a proc that will basically compare srclist with destlist, and
> will return two lists:

> - list of items to add
> - list of items to delete

> Here's my working example:

> proc lcompare {l1 l2} {
>      set nl [list]; # new elements list
>      set rl [list]; # removed elements list
>      foreach le $l1 {
>          if {[lsearch -exact $l2 $le]<0} {
>              lappend rl $le
>          }
>      }
>      foreach le $l2 {
>          if {[lsearch -exact $l1 $le]<0} {
>              lappend nl $le
>          }
>      }
>      return [list $nl $rl]
> }

> The problem is that I will need to use this function to compare larger
> lists, most of which remain unchanged.

> When comparing lists of about 7000 elements, it takes 10s to compare it
> on a P3/850. And my software will probably have to compare about 50k
> elements...

> I'm looking for a way (Tcl or C, or both :) to do it faster.

> For now I guess that if I'll sort these lists and compare them at once,
> it should speed up. This is just a though though...

> Has anyone done similar things before? Any clues or maybe some code? :-)

> --
> WK

> "Data typing is an illusion. Everything is a sequence of bytes."
>                                                               -Todd Coram

--
+--------------------------------+---------------------------------------+
| Gerald W. Lester               | "The man who fights for his ideals is |

+--------------------------------+---------------------------------------+


Wed, 01 Dec 2004 23:19:43 GMT  
 Comparing two lists for differences

Quote:

> Hello.

> I need a proc that will basically compare srclist with destlist, and
> will return two lists:

> - list of items to add
> - list of items to delete

This sounds suspiciously like sequence comparison done in bioinformatics.
The keyword to search for is BLAST.

Another hint is the Levensthein
distance for strings which, based on linear programming, computes an edit
distance between two strings based on weighting the edit operations
delete, replace, insert. A bit of added value on that algorithm spits out
the edit operations you require.

Note: The sequence of edit operations is not unique in the general case.

  Harald Kirsch



Thu, 02 Dec 2004 02:48:23 GMT  
 Comparing two lists for differences

Quote:

> [cut]
> ##
> ## Using TclX
> ##
> set results [intersect3 lista listb
> set addList [lindex $results 0]
> set delList [lindex $results 2]

Damn ;-) I need to learn the TclX stuff, I see there's a lot I could use
there...

I wrote my own C function in the meantime, so I'll just compare the
speed on 100k lists and choose the faster method...

But thanks a lot.

--
WK

"Data typing is an illusion. Everything is a sequence of bytes."
                                                              -Todd Coram



Thu, 02 Dec 2004 03:15:25 GMT  
 Comparing two lists for differences

Quote:

>I need a proc that will basically compare srclist with destlist, and
>will return two lists:
>- list of items to add
>- list of items to delete

Does the order of the items matter?  What about duplicate
elements?  (In the example code you posted, the order does not
matter and duplicates are ignored; i.e., you're treating
the lists as sets rather than as bags or sequences.  Just
wanted to make sure that is what is intended.)

Quote:
>Here's my working example:
> [... snipped ...]

>When comparing lists of about 7000 elements, it takes 10s to compare it
>on a P3/850. And my software will probably have to compare about 50k
>elements...
>I'm looking for a way (Tcl or C, or both :) to do it faster.
>For now I guess that if I'll sort these lists and compare them at once,
>it should speed up.

That would definitely speed things up.  A *lot*.

But there's a simpler way to do it:  record the list elements
in an array first and use [info exists] instead of [lsearch]
to test for membership.  [lsearch] will examine (on average)
half of the list items, whereas [info exists] takes (essentially)
constant time.  So instead of taking time proportional to the
product of the lengths of the lists, it takes time proportional
to their sum.

    # warning: untested code.
    #
    proc lcompare {l1 l2} {
        set nl [list]; # new elements list
        foreach e1 $l1 { set A1($e1) {} }
        foreach e2 $l2 { if {![info exists A1($e2)]} { lappend nl $e2 } }

        set rl [list]; # removed elements list
        foreach e2 $l2 { set A2($e2) {} }
        foreach e1 $l1 { if {![info exists A2($e1)]} { lappend rl $e1 } }

        return [list $nl $rl]
    }

If you want to treat the lists as bags instead of sets (i.e.,
duplicated elements make a difference but the order doesn't),
then sorting both lists first and doing a merge-comparison is
probably the best approach.

To treat them as sequences (i.e., the order also makes a difference),
it gets a bit more complicated.  Search the wiki (http://wiki.tcl.tk)
for "Longest Common Subsequence" for more details.

--Joe English




Thu, 02 Dec 2004 06:46:50 GMT  
 Comparing two lists for differences

Quote:


>>I need a proc that will basically compare srclist with destlist, and
>>will return two lists:
>>- list of items to add
>>- list of items to delete

> Does the order of the items matter?  What about duplicate
> elements?  (In the example code you posted, the order does not
> matter and duplicates are ignored; i.e., you're treating
> the lists as sets rather than as bags or sequences.  Just
> wanted to make sure that is what is intended.)

It is intended to work on email address list, mostly to create a
difference lists. So the elements will not have duplicates and the order
does not matter. Speed does matter ;)

Quote:
>>Here's my working example:
>>[... snipped ...]

>>When comparing lists of about 7000 elements, it takes 10s to compare it
>>on a P3/850. And my software will probably have to compare about 50k
>>elements...
>>I'm looking for a way (Tcl or C, or both :) to do it faster.
>>For now I guess that if I'll sort these lists and compare them at once,
>>it should speed up.

> That would definitely speed things up.  A *lot*.

Yeap. Turned out to be a lot faster, even than intersect3 from TclX. I'm
using lsort from Tcl, then call my internal C function and presto.

Works out two lists of 400k elements in about 6s on P3/850 - intersect3
takes over 10s.

Quote:
> But there's a simpler way to do it:  record the list elements
> in an array first and use [info exists] instead of [lsearch]
> to test for membership.  [lsearch] will examine (on average)
> half of the list items, whereas [info exists] takes (essentially)
> constant time.  So instead of taking time proportional to the
> product of the lengths of the lists, it takes time proportional
> to their sum.

I'm not sure if the foreach loop with set A1($e1) {} would slow things
up way too much... Besides, the 3rd foreach loop could be added to 2nd
loop - would also speed it up a bit.

But, since I have the Tcl+C version already, I guess it's done. Thanks.

--
WK

"Data typing is an illusion. Everything is a sequence of bytes."
                                                              -Todd Coram



Thu, 02 Dec 2004 16:58:28 GMT  
 Comparing two lists for differences

Quote:

> Hello.

> I need a proc that will basically compare srclist with destlist, and
> will return two lists:

> - list of items to add
> - list of items to delete

Seems like set operations (set difference, symmetrical set difference)

See
        http://mini.net/tcl/SetOps

for various implementations in tcl with timing results.

See also
        http://www.purl.org/net/akupries/soft/setops/index.html

for something in C.

--
Sincerely,

        Join us in Sept. for Tcl'2002: http://www.tcl.tk/community/tcl2002/


        Private         <http://www.purl.org/NET/akupries/>
-------------------------------------------------------------------------------

Quote:
}



Sat, 04 Dec 2004 22:19:09 GMT  
 
 [ 7 post ] 

 Relevant Pages 

1. How to compare two lists ?

2. finding and listing the differences between two files

3. difference between two lists

4. how to compare two columns in two files?

5. selecting two items in two list boxes

6. Comparing any dates and getting difference in days

7. reading two files and comparing them

8. comparing two files

9. Comparing two databases.

10. Comparing fields in a two records

11. comparing two arrays with same length

12. comparing two dbfs record by record

 

 
Powered by phpBB® Forum Software