Minimal diffs in difflib? 
Author Message
 Minimal diffs in difflib?

When asking about using difflib to implement version control, I was
pointed to the get_opcodes method of SequenceMatcher by Tim Peters,
and that was indeed helpful. And after looking a bit at difflib, I
think the "intuitive" deltas computed are very nice for showing users
the difference between one revision and the previous one.

However, I think it would be nice to be able to compute minimal deltas
too, to increase the compression of the repository, and so I started
wondering... Would it be interesting to include the Levenshtein
algorithm in difflib as well? (Or possibly add a separate module or
something?) The basic algorithm is very small/simple[1], so adding it
wouldn't be much of a burden to the standard lib, would it? (And PHP
has it, so why shouldn't we? <wink>)

On the other hand, even though this might be a useful/natural
algorithm to have in a diff lib, I may not save that much space with
it... I guess I had better do some experimenting. Perhaps using zlib
or gzlib is better <wink>

[1] http://www.*-*-*.com/

Note: This implementation only computes the distance, not the delta,
but the change needed is minimal.

--
Magnus Lie Hetland                                  The Anygui Project
http://www.*-*-*.com/                                   http://www.*-*-*.com/



Tue, 22 Jun 2004 02:05:02 GMT  
 Minimal diffs in difflib?
[Magnus Lie Hetland]

Quote:
> When asking about using difflib to implement version control, I was
> pointed to the get_opcodes method of SequenceMatcher by Tim Peters,
> and that was indeed helpful. And after looking a bit at difflib, I
> think the "intuitive" deltas computed are very nice for showing users
> the difference between one revision and the previous one.

Indeed, I dreamt up the SequenceMatcher algorithm in the early 80's *for* a
source-control system, precisely because "minimal diffs" were too often
unintuitive:  patches are read more often by people than by source control
systems, and saving a few bytes wasn't worth it even back then.

Quote:
> However, I think it would be nice to be able to compute minimal deltas
> too, to increase the compression of the repository, and so I started
> wondering...

Quantify first.  Except in pathological cases, SequenceMatcher diffs are
usually size-competitive with minimal diffs, they simply don't synch up on
ridiculously accidental matches.  In non-pathological cases, a synch tool
generally has *many* potential matches to choose from, and most equally good
in the end (wrt total diff size); SequenceMatcher strives to pick the pair
with maximal matching local context, which very often corresponds to how the
source was actually edited.

Quote:
> Would it be interesting to include the Levenshtein algorithm in difflib
> as well?

Not to me <wink>.

Quote:
> (Or possibly add a separate module or something?)

difflib may turn into a pkg someday, if enough stuff gets added to it.  In
the meantime, other diff methods "belong" there.

Quote:
> The basic algorithm is very small/simple[1], so adding it
> wouldn't be much of a burden to the standard lib, would it? (And PHP
> has it, so why shouldn't we? <wink>)

An industrial-strength version isn't small or simple -- study your Unix diff
source code for why.


Tue, 22 Jun 2004 02:59:12 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. Difflib question

2. difflib and analyzing filetrees

3. Teaching difflib.Differ new tricks

4. Deltas with difflib?

5. difflib

6. difflib optimization -- wow (tim peters?)

7. diffs for APL-11?

8. Array sorting in Mawk (diffs)

9. gawk-3.0.6-networking-diffs

10. Dictionary Diffs

11. Diffs for GNU Emacs Eiffel mode to fix comment indenting

12. diffs between US and International 5.3a and 5.3b patches

 

 
Powered by phpBB® Forum Software