Subject: Re: bin/10625: /usr/bin/cmp is unable to compare rather large files
To: NetBSD GNATS submissions and followups <gnats-bugs@gnats.netbsd.org>
From: Greg A. Woods <woods@weird.com>
List: netbsd-bugs
Date: 07/26/2000 23:33:22
[ On Wednesday, July 26, 2000 at 18:45:44 (-0700), R. C. Dowdeswell wrote: ]
> Subject: Re: bin/10625: /usr/bin/cmp is unable to compare rather large files 
>
> Well, mmap() by its very operation can't handle files that are too
> large, especially on 32bit arches.  Mmap(2) will map the entire
> file into the processes address space, which is limited.  Looks
> like the above files together total 2.6GB and cmp only uses two
> mmaps so it is conceivable that problems might occur.

Ah yes -- I guess didn't really pay attention to that little detail....

> The issue is not whether read(2) or mmap(2) is used, but rather
> the fact that cmp tries to get both files in their entirety into
> its process address space.

I think those two issues are orthogonal, no?  Avoiding mmap() would have
prevented this bug from ever even peeping over the horizon.

I can see how mmap() can be more efficient in terms of CPU resources and
memory throughput (buffers never have to be copied to user-space if the
kernel does the mapping and page-faulting carefully) but in the case of
a program like cmp there's almost always going to be lots of time spent
waiting for disks to spin (and I suspect the ratio of secondary storage
access speeds and CPU/RAM speeds is fairly constant across all major
classifications of machines too).

So, given the extra complications and complexities that mmap()
introduces into cmp I've still got to wonder at the sanity of it all.
Is mmap() the hammer in every programmer's tool kit these days?

What's even more silly is that if indeed mmap() is a much more efficient
way to access files (at least those smaller than SIZE_T_MAX :-) then
wouldn't it be a heck of a lot smarter and simpler to write this code
once in a stdio-like library (pkgsrc/devel/sfio already does this, for
example) and then keep all the applications dead simple and straight
forward?

In fact now that SFIO is really freeware (well no fee for redistribution
is permitted) perhaps it should be considered directly.

>  One would have the same problem if
> one tried to compare these two files with only two read(2)s.  :-)

nobody in their right mind would try to do that!  ;-)

> In fact perusing the source code (prior to being fixed), it will
> only fail for large files that are not too large on 32bit
> architechtures.  :-)

Yup!

However I'm not sure the bug is *really* fixed.  Shouldn't the test used
to avoid mmap() be to see if the length is bigger than (SIZE_T_MAX / 2)?

Does mmap() really fail if the file is bigger than the *remaining* size
of the process's address space (the manual page makes no obvious claims
along these lines)?

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>