Subject: GNU "diff" considered damaged
To: Kahari, Andreas <andreas.kahari@agresearch.co.nz>
From: Mike Cheponis <mac@Wireless.Com>
List: port-i386
Date: 07/12/2001 13:30:01
Why are we using the braindead GNU diff?  Can't we get one that acutally
works?

Thanks -Mike

p.s. Tnx for your suggestion about "cmp".  But having both "diff" and "cmp"
is sort of like having both "ls" and (a currently non-existent) "dir", yes?

On Fri, 13 Jul 2001, Kahari, Andreas wrote:

> Date: Fri, 13 Jul 2001 08:48:20 +1200
> From: "Kahari, Andreas" <andreas.kahari@agresearch.co.nz>
> Subject: RE: "diff" loses on large files?
>
> This is well documented in the diff GNU info manuals.
> The proposed workaround is to compare checksums (see info doc
> or e.g. "http://www.gnu.org/manual/diffutils-2.7/html_node/diff_90.html").
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Handling Files that Do Not Fit in Memory

diff operates by reading both files into memory. This method fails if the files are too large, and diff should have a fallback.

One way to do this is to scan the files sequentially to compute hash codes of the lines and put the lines in equivalence classes based only on hash code. Then compare the files normally. This does produce some false matches.

Then scan the two files sequentially again, checking each match to see whether it is real. When a match is not real, mark both the "matching" lines as changed. Then build an edit script as usual.

The output routines would have to be changed to scan the files sequentially looking for the text to print.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~