netbsd-bugs: bin/5646: makewhatis could be smarter w.r.t. gz

Subject: bin/5646: makewhatis could be smarter w.r.t. gz
To: None <gnats-bugs@gnats.netbsd.org>
From: None <bgrayson@ece.utexas.edu>
List: netbsd-bugs
Date: 06/24/1998 01:14:55
>Number:         5646
>Category:       bin
>Synopsis:       makewhatis could be smarter w.r.t. gzipped *roff source
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    bin-bug-people (Utility Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jun 23 23:20:00 1998
>Last-Modified:
>Originator:     Brian Grayson
>Organization:
	Parallel and Distributed Systems
	Electrical and Computer Engineering
	The University of Texas at Austin
>Release:        June 22, 1998
>Environment:
NetBSD k9.ece.utexas.edu 1.3F NetBSD 1.3F (K9.new) #23: Mon Jun 22 23:26:39 CDT 1998 root@snowy.ece.utexas.edu:/home/src/sys/arch/i386/compile/K9.new i386

>Description:
      Background:
	makewhatis deals ``intelligently'' with uncompressed
	*roff source by passing several file names via xargs to
	/usr/libexec/getNAME, thus reducing the number of
	processes/forks to substantially less than one per man page.

	It also deals semi-intelligently with already-*roff'd
	output (cat*), but still uses 1 or 2 processes per man
	page, but that's another fish to fry.

      Problem:
	A MAJOR deficiency in the current version is its handling
	of gzipped *roff source files.  It gzcat's the file, then
	pipes it to nroff -man (thus converting the _whole_
	document), and then uses sed to grab the interesting
	info, which is usually at the very top of the document.

	I use the MANZ option, and running makewhatis took:
	1692.169u 516.891s 37:02.59 99.3% 0+0k 4109+1046io 20pf+0w
	on a P-90 with everything local.  From timings, doing
	all those nroff's and sed's was responsible for more
	than 90% of the user CPU time.

	There are several different ways to improve makewhatis
	to handle gzipped *roff source more efficiently:
	1.  gzcat the source to a /tmp file with the correct name
	    (mail.1.gz --> /tmp/mail.1 -- use sed to do
	    this conversion), then run getNAME on that file,
	    and pipe the result through the sed '\\-' cleanup script.
	2.  gzcat the file to stdin of getNAME, and modify
	    getNAME so that it can read from stdin, with
	    the section number explicitly specified via a ``use
	    stdin, and use this section number'' switch.  A minor
	    variation:  ``use stdin, and act as if you are using
	    this file name'' -- this avoids an extra sed per man
	    page to grab the section number.
	3.  Modify getNAME to use zlib to deal with gzip'd files
	    natively.  Unfortunately, the current construction
	    of getNAME, which assumes an freopen'd stdin for
	    everything, makes it difficult -- a gznewman(),
	    gzoldman(), etc. would all need to be written.
	4.  Modify getNAME to use popen() when it sees a .gz or
	    .Z file.  Do the appropriate magic to grab the
	    section number, etc.

	Option 3 would require the most extensive changes, but
	would also provide the best performance (fewest forks, as
	xargs could be used to bundle a whole bunch of files in
	one invocation, and also best CPU performance because the
	whole file won't be gunzip'd, only up to what is needed,
	which is usually just the first dozen lines or so).  

	The first option can be done simply by changing the
	makewhatis script.  I implemented the first option, and
	the running time decreased to:
	151.453u 358.774s 8:56.29 95.1% 0+0k 3869+8427io 23pf+0w

	However, all that gzcat'ing of every man page to /tmp
	definitely keeps the disk busy, and dealing with all
	the gzipped *roff source took 6 real-time minutes of
	that ~9 minutes elapsed time.

	One minor point:  the output from the current version and
	the version using the patch below is slightly different,
	due to whitespace differences (i.e. 
	``boot (8) - bootstrapping'' versus
	``boot (8) -  bootstrapping'')  These could be fixed via
	yet-another sed script at the very end.  There are also a
	few other weirdness things that I'm sure the makewhatis
	gurus could track down quicker than me!
	
>How-To-Repeat:
	
>Fix:
	As mentioned above, this patch doesn't completely solve
	the problem -- there are still whitespace issues and some
	weirdness, and it is not very efficient (just more
	efficient than the current method).  But it's a start.
--- makewhatis  Mon Jun  1 07:49:45 1998
+++ makewhatis.mine     Wed Jun 24 01:04:00 1998
@@ -37,9 +37,12 @@
 
 egrep '\.[1-9].(gz|Z)$' $LIST | while read file
 do
-       gzip -fdc $file | nroff -man | \
-       sed -n -f $MKWHATIS;
-done >> $TMP
+       newname=/tmp/`basename $file | sed -e 's/.gz//'`
+       gzip -fdc $file  > $newname
+       /usr/libexec/getNAME $newname
+       rm $newname
+       #gzip -fdc $file | nroff -man | sed -n -f $MKWHATIS;
+done | sed -e 's/ [a-zA-Z0-9]* \\-/ -/' >> $TMP
 
 egrep '\.0$' $LIST | while read file
 do
	
>Audit-Trail:
>Unformatted: