tech-net: squid now reveals a new kernel problem.

Subject: squid now reveals a new kernel problem.
To: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.ORG>
From: Greg A. Woods <woods@most.weird.com>
List: tech-net
Date: 10/28/1999 12:15:48
This is NetBSD-1.3.3 with the sys_accept() patch I posted, as well as
other official patches applied.  It's running Squid-2.2STABLE5 as a
transparent cache using IP Filter's NAT to gain transparency.  I do also
have patches to the NAT code installed and I don't remember if they came
from the 1.3.x branch or from a newer version of IPF, or if they were
necessary for transparent caching or not.

At 6:00am we have cron call 'squid -k rotate' which sends a signal to
the primary squid process which tells that process to rotate its log
files.  It also seems to re-write its cache.state databases too.

Ever since trying either the original patch suggested by Darren, or my
patch using ffree() instead of closef() the kernel has been reporting
various numbers (about 80 yesterday, then only about 9 before the crash)
of messages like the following just as squid does its log rotation:

Oct 28 06:00:13 sunset /netbsd: Data modified on freelist: word 3 of object 0xf0d0e4c0 size 56 previous type file (0xdeadbef0 != 0xdeadbeef)

The value of 'type' is any one of 'temp', 'file', 'mbuf', 'VM mapent',
'pcb', or 'LFS segment', 'size' ranges about seemingly in sync with
'type', the 'word' is always '3', and the value of the data is almost
always '0xdeadbef0' or rarely '0xdeadbef1'.

This suggests to me that whatever's doing the tromping also always just
sets that last byte to either F0 or very rarely sometimes F1, but those
values don't mean anything significant to me.

(The only thing I thought was odd was that 'LFS segment' appeared when
this system doesn't even have LFS compiled in, but I now see that
kern/vfs_cluster.c also uses 'M_SEGMENT' to identify memory it
allocates.)

Then eventually (after two 'rotates' in this case) the system paniced,
again with:

Oct 28 10:31:31 sunset savecore: reboot after panic: closef: count < 0

(The time shown above is much later than 6:04 when the panic actually
happened because nobody was able to get to the machine until then...
something about losing the instructions for accessing the remote
console server....)

Note that in the 24 hours between the first "rotate" when the first
kernel malloc() warnings appeared and the second "rotate" when the crash
occured this machine served over 

At this point I'm thinking of doing something drastic, such as just
upgrading the machine to 1.4.1 (perhaps with a -release kernel).  I may
even be able to do this without taking the machine out of service for
long because one of its twins is about to be relieved of its current
duties and be turned into a sibling cache anyway.  However I really
would like to know what's wrong here, especially so that I can be sure
that an upgrade will really fix it.

Meanwhile I'm going to try ktracing the "rotate" on a test machine,
perhaps peer at some squid code, and of course move the production
"rotate" from 6:00 to sometime when staff are available to watch it and
help the machine recover should it need such help!

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>