tech-userlevel: Re: bin/10625: /usr/bin/cmp is unable to compare rather large files

Subject: Re: bin/10625: /usr/bin/cmp is unable to compare rather large files
To: Jarom r Dole ek <dolecek@ibis.cz>
From: R. C. Dowdeswell <elric@mabelode.imrryr.org>
List: tech-userlevel
Date: 07/27/2000 14:04:47

On 964723415 seconds since the Beginning of the UNIX epoch
Jarom r Dole ek wrote:
>
>> SIZE_T_MAX / 2 is still too large, because it does not account for
>> the kernel address space and the memory that is used by cmp for
>> other purposes.
>
>Actually, it's not needed to take other memory users into
>considerations. Since cmp(1) uses mmap(..MAP_SHARED|MAP_FILE) and
>madvise(2)s kernel to MADV_SEQUENTIAL, it's using just one real
>memory page for each compared file at any time (the the no longer
>needed memory pages are continuously freed). MADV_SEQUENTIAL is cool :)

But you must consider that the process' address space needs to be
_able_ to address all of the memory at sequential pointer locations.
So, although all of the file is not in memory, the process must
have enough virtual memory reserved for this use.

>> Probably the best approach would be to use mmap(2)
>> in a manner more consistent with how one would use read(2), i.e.:
>> mmap(2) relatively small chunks of the file in a loop.
>
>"why bother with mmap() then?" :)
>Yes, this should probably be done. But it's too hairy to find
>out what is the maximum usable "relatively small chunk" on given
>machine/OS, so it's probably just easier to mmap() whole file
>and if it's too big, just use read().

The advantage of mmap(2) is not that it makes writing the program
noticeably easier, it is that you avoid copying all of the data into
the processes address space.

>Files up to about 3.6GB should be ok even on 32bit machines (or
>two files which have together this size). If not, that would be a bug.
>Note that the original mmap(2) call in cmp(1) failed for the files because it
>used MAP_PRIVATE - once that has been removed or substituted
>by MAP_SHARED, the mmap(2) succeeded.
>
>3.6GB is the maximum address space available for userland programs
>on i386 ATM IIRC.

Only if given a perfect allocation scheme.  But, mmap(2) must choose without
prior knowledge of the next mmap(2) call where to put the region in the
processes address space, and there may be further limitations based on
the choices and tradeoffs that it makes.  For example, one has to leave
sbrk(2) some room to increase the size of the heap or the program's next
malloc(3) may fail.  So, you can't just necessarily tack mmap(2)'s region
to the end of the heap.  A similar argument applies to the stack.

 == Roland Dowdeswell                      http://www.Imrryr.ORG/~elric/  ==
 == The Unofficial NetBSD Web Pages        http://www.Imrryr.ORG/NetBSD/  ==
 == The NetBSD Project                            http://www.NetBSD.ORG/  ==