Subject: Re: GNU "diff" considered damaged
To: Mike Cheponis <mac@Wireless.Com>
From: Greywolf <greywolf@starwolf.com>
List: port-i386
Date: 07/12/2001 18:28:04
On Thu, 12 Jul 2001, Mike Cheponis wrote:

# p.s. Tnx for your suggestion about "cmp".  But having both "diff" and "cmp"
# is sort of like having both "ls" and (a currently non-existent) "dir", yes?

Not really.

Diff will actually show the differences; cmp will merely abort at the first
one and bitch mightily.  Sometimes it is only wanted to know whether or not
they are different.  Other times you want context.

...and why it reads BOTH files ALL THE WAY into memory is beyond me.  You'd
think they could do it chunks at a time (like any sane utility).  512K/file
strikes me as plenty of data to operate on, maybe 1M/file.

# 
# On Fri, 13 Jul 2001, Kahari, Andreas wrote:
# 
# > Date: Fri, 13 Jul 2001 08:48:20 +1200
# > From: "Kahari, Andreas" <andreas.kahari@agresearch.co.nz>
# > Subject: RE: "diff" loses on large files?
# >
# > This is well documented in the diff GNU info manuals.
# > The proposed workaround is to compare checksums (see info doc
# > or e.g. "http://www.gnu.org/manual/diffutils-2.7/html_node/diff_90.html").
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Handling Files that Do Not Fit in Memory
# 
# diff operates by reading both files into memory. This method fails if the files are too large, and diff should have a fallback.
# 
# One way to do this is to scan the files sequentially to compute hash codes of the lines and put the lines in equivalence classes based only on hash code. Then compare the files normally. This does produce some false matches.
# 
# Then scan the two files sequentially again, checking each match to see whether it is real. When a match is not real, mark both the "matching" lines as changed. Then build an edit script as usual.
# 
# The output routines would have to be changed to scan the files sequentially looking for the text to print.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 
# 
# 


				--*greywolf;
--
                 A _Real_ Operating System for _Real_ Hackers.

                   ______ _   __     __  ____ _____ ____ 
                  ______ / |_/ /__  / /_/ __ ) ___// __ \
                 ______ /  |/ / _ \/ __/ __  \__ \/ / / /
                ______ / /|  /  __/ /_/ /_/ /__/ / /_/ / 
               ______ /_/ |_/\___/\__/_____/____/_____/  

	 With many thanks to the NetBSD development team and UCB CSRG.