Subject: Re: sort(1) opens too many files
To: Geoff Wing <gcw@pobox.com>
From: Dave Huang <khym@azeotrope.org>
List: current-users
Date: 04/06/2001 02:03:29
On Thu, 5 Apr 2001, Geoff Wing wrote:
> At the time, I had a quick look through the algorithm to see if any
> of the main stacks could have their limits bumped up, e.g. increase
> MAXFCT or MAXNUM or whatever, but didn't follow through.  I just
> presumed there would be performance issues involved as well.
> But since you're so willing to look at it . . . . :-)

Well, I have no idea how the algorithm works, but the temp files it
makes seem to be about DEFLLEN bytes long. If I bump that number up from
64K to 1 meg, sorting /usr/share/dict/words makes 7 ~1meg temp files,
rather than 120 or so ~64K temp files. It's a bit faster too... I
would've expected it to allocate more memory, but it doesn't seem to be
any worse; it looks like it allocates enough memory to hold the input
file.

BTW, with a DEFLLEN of 64k, sort can't sort this file, even with my
descriptor limit at the kern.maxfiles limit of 1772:

-rw-r-----  1 khym  wheel  32738031 Apr  6 01:41 mbox

# time /usr/bin/sort ~khym/mbox > /dev/null
sort: Invalid argument
7.746u 9.754s 0:47.32 36.9%     0+0k 0+1146io 0pf+0w

whereas with DEFLLEN bumped to 1 meg, it seems to work fine:
# time obj.alpha/sort ~khym/mbox > /dev/null
6.163u 3.814s 0:22.18 44.9%     0+0k 0+165io 0pf+0w

FWIW, it passes the sort regression test, except for #40, except it's
not sort's fault :)

40 (long)
awk: empty regular expression
 source line number 8
 context is
         >>> // <<<

and sure enough, it's trying to do a match on an empty regexp, which
apparently GNU awk is happy with, but real awk isn't. I assume an empty
regexp matches any line, so I just deleted the "//", and that made
everything happy.