Subject: bin/5646: makewhatis could be smarter w.r.t. gz
To: None <firstname.lastname@example.org>
From: None <email@example.com>
Date: 06/24/1998 01:14:55
>Synopsis: makewhatis could be smarter w.r.t. gzipped *roff source
>Responsible: bin-bug-people (Utility Bug People)
>Arrival-Date: Tue Jun 23 23:20:00 1998
>Originator: Brian Grayson
Parallel and Distributed Systems
Electrical and Computer Engineering
The University of Texas at Austin
>Release: June 22, 1998
NetBSD k9.ece.utexas.edu 1.3F NetBSD 1.3F (K9.new) #23: Mon Jun 22 23:26:39 CDT 1998 firstname.lastname@example.org:/home/src/sys/arch/i386/compile/K9.new i386
makewhatis deals ``intelligently'' with uncompressed
*roff source by passing several file names via xargs to
/usr/libexec/getNAME, thus reducing the number of
processes/forks to substantially less than one per man page.
It also deals semi-intelligently with already-*roff'd
output (cat*), but still uses 1 or 2 processes per man
page, but that's another fish to fry.
A MAJOR deficiency in the current version is its handling
of gzipped *roff source files. It gzcat's the file, then
pipes it to nroff -man (thus converting the _whole_
document), and then uses sed to grab the interesting
info, which is usually at the very top of the document.
I use the MANZ option, and running makewhatis took:
1692.169u 516.891s 37:02.59 99.3% 0+0k 4109+1046io 20pf+0w
on a P-90 with everything local. From timings, doing
all those nroff's and sed's was responsible for more
than 90% of the user CPU time.
There are several different ways to improve makewhatis
to handle gzipped *roff source more efficiently:
1. gzcat the source to a /tmp file with the correct name
(mail.1.gz --> /tmp/mail.1 -- use sed to do
this conversion), then run getNAME on that file,
and pipe the result through the sed '\\-' cleanup script.
2. gzcat the file to stdin of getNAME, and modify
getNAME so that it can read from stdin, with
the section number explicitly specified via a ``use
stdin, and use this section number'' switch. A minor
variation: ``use stdin, and act as if you are using
this file name'' -- this avoids an extra sed per man
page to grab the section number.
3. Modify getNAME to use zlib to deal with gzip'd files
natively. Unfortunately, the current construction
of getNAME, which assumes an freopen'd stdin for
everything, makes it difficult -- a gznewman(),
gzoldman(), etc. would all need to be written.
4. Modify getNAME to use popen() when it sees a .gz or
.Z file. Do the appropriate magic to grab the
section number, etc.
Option 3 would require the most extensive changes, but
would also provide the best performance (fewest forks, as
xargs could be used to bundle a whole bunch of files in
one invocation, and also best CPU performance because the
whole file won't be gunzip'd, only up to what is needed,
which is usually just the first dozen lines or so).
The first option can be done simply by changing the
makewhatis script. I implemented the first option, and
the running time decreased to:
151.453u 358.774s 8:56.29 95.1% 0+0k 3869+8427io 23pf+0w
However, all that gzcat'ing of every man page to /tmp
definitely keeps the disk busy, and dealing with all
the gzipped *roff source took 6 real-time minutes of
that ~9 minutes elapsed time.
One minor point: the output from the current version and
the version using the patch below is slightly different,
due to whitespace differences (i.e.
``boot (8) - bootstrapping'' versus
``boot (8) - bootstrapping'') These could be fixed via
yet-another sed script at the very end. There are also a
few other weirdness things that I'm sure the makewhatis
gurus could track down quicker than me!
As mentioned above, this patch doesn't completely solve
the problem -- there are still whitespace issues and some
weirdness, and it is not very efficient (just more
efficient than the current method). But it's a start.
--- makewhatis Mon Jun 1 07:49:45 1998
+++ makewhatis.mine Wed Jun 24 01:04:00 1998
@@ -37,9 +37,12 @@
egrep '\.[1-9].(gz|Z)$' $LIST | while read file
- gzip -fdc $file | nroff -man | \
- sed -n -f $MKWHATIS;
-done >> $TMP
+ newname=/tmp/`basename $file | sed -e 's/.gz//'`
+ gzip -fdc $file > $newname
+ /usr/libexec/getNAME $newname
+ rm $newname
+ #gzip -fdc $file | nroff -man | sed -n -f $MKWHATIS;
+done | sed -e 's/ [a-zA-Z0-9]* \\-/ -/' >> $TMP
egrep '\.0$' $LIST | while read file