Subject: Re: sort(1) opens too many files
To: Dave Huang <khym@azeotrope.org>
From: Greg A. Woods <woods@weird.com>
List: current-users
Date: 04/06/2001 13:56:01
[[ I've not really studied the sort code in detail, though I have a
general idea of how it probably works and some of the details should
still be correctly documented in the comments and naming conventions. ]]

[ On Friday, April 6, 2001 at 02:03:29 (-0500), Dave Huang wrote: ]
> Subject: Re: sort(1) opens too many files
>
> Well, I have no idea how the algorithm works, but the temp files it
> makes seem to be about DEFLLEN bytes long. If I bump that number up from
> 64K to 1 meg, sorting /usr/share/dict/words makes 7 ~1meg temp files,
> rather than 120 or so ~64K temp files. It's a bit faster too... I
> would've expected it to allocate more memory, but it doesn't seem to be
> any worse; it looks like it allocates enough memory to hold the input
> file.

It definitely would allocate more memory since DEFLLEN is used to
initialise bufsize which is used as the parameter to malloc() for
allocating the data buffer.  However this is just the record (line)
length if I read things correctly, and it can be multiplied by two in
some cases before the buffer is realloc()ed.

So indeed changing the value of DEFLLEN shouldn't affect the amount of
memory allocated for a given data set.

What's surprising is that changing DEFLLEN affects the size of the
temporary files!

> BTW, with a DEFLLEN of 64k, sort can't sort this file, even with my
> descriptor limit at the kern.maxfiles limit of 1772:

Someone said the sorted temporary files are not closed after they're
written to?  If they are unlinked immediately after they're created,
then that means that if they're closed they'll simply disappear
immediately and that of course means that sort must hold them all open
between the time it writes them and the time it reads and merges them.

I suspect if the unlinking operation were to be saved until after the
temporary files have been merged, and that if instead of keeping open
file descriptors in struct tempfile that the names of the temporary
files are saved there, and if they're closed immediately after they're
initially written to and then re-opened for merging, then sort could
sort a file of any size so long as there's enough temporary space to
hold the temporary merge files on whatever partition is pointed to by
$TMPDIR (or '-t tempdir' if given on the command line).

However there's some confusion in my mind over MERGE_FNUM and the fact
that you can't specify more than ((1000 - (16+1))*16) input files when
you're using the '-m' option.

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>     <woods@robohack.ca>
Planix, Inc. <woods@planix.com>;   Secrets of the Weird <woods@weird.com>