current-users: Re: Problems with NetBSD 1.4.1+ on Sparc 5

Subject: Re: Problems with NetBSD 1.4.1+ on Sparc 5
To: Charles M. Hannum <root@ihack.net>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: current-users
Date: 12/28/1999 16:19:05
	Hello.  I've been having a long-standing problem with NetBSD on the 
Sparc architecture, which should be archived on the port-sparc archives.
However, my search for the cure has lead me to a problem that might affect
all ports.  Thus, I am forwarding my description of the problem to the
NetBSD community in the hope that some might have seen the problem
elsewhere, and to give non-Sparc users a heads up to the notion that there
might be a problem in the machine-independent portion of the code.
For those that are interested, I will happily provide the prequal mail
which describes my problem in more detail, including tracebacks, and the
like.

	OK.  I've narrowed the problem further, but my understanding of what's
going on simultaneously in the kernel is slowing my progress.  I believe
there is a race condition in the kernel whenever a new inode is created in
an FFS filesystem on Sparc machines.  I do not know if this problem extends
to architectures outside the Sparc, but on 1.4.1, and 1.4.2, if many inodes
are being created and destroyed simultaneously on an FFS filesystem, such
as might happen on a news server, there is some sort of context race
whereby when ffs_valloc() calls ffs_hashalloc() with the  allocator pointed
at ffs_nodealloccg, a condition occurs causing a memory exception to occur
just at the point when ffs_nodealloccg would call skpc() for the first
time.  If inodes are created on a quiescent system, all works fine.  This
condition isn't extremely predictable, but I have a system now where it
occurs quite consistently.  Also, because it only happens when
ffs_nodealloccg() is used to allocate a new inode, the problem doesn't
occur immediately on a new filesystem.  As inodes are allocated and freed,
and the location of free inodes becomes fragmented with respect to the
cylinder groups, or at least as the likelyhood of the preferred inode being
free becomes less and less, the conditions causing this panic become more
and more prevalent.
	Could someone explain to me how the locking mechanism works for inode 
allocation/deallocation for the FFS filesystem?  I don't see many sleeps in
the FFS or UFS code, but I don't know what steps are taken to keep the
bottom half of the kernel from corrupting the top structures.
	If someone could suggest a document to read that might help with this, or
if someone could suggest a fix, that would be great.
I'm nearly at my wit's end on this machine, and I think this has been my
problem since 1.3, though I couldn't pinpoint it this clearly before.

Any help would be greatly appreciated.
-thanks
-Brian