Subject: Re: Please audit pool use in your code!
To: Thor Lancelot Simon <tls@rek.tjls.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-kern
Date: 07/19/2006 19:38:08
On Wed, Jul 19, 2006 at 01:22:59PM -0400, Thor Lancelot Simon wrote:
> Last week I sent a message to this list about UFS_DIRHASH and kernel memory
> corruption.  Since then it has become clear that we have more serious issues,
> at least on the 3.0 branch; removing UFS_DIRHASH has made our systems run
> for significantly longer without crashing, but when they do crash, we see
> the same basic symptom: pool-allocated objects are overwritten with bogus
> data, often leading to a panic when a bad pointer is followed out of such
> a structure.
> 
> One likely cause of this problem is allocation from a pool in interrupt
> context.  Any such allocation *or free* (pool_put/pool_get) *must* be
> protected with spl such that no other code allocating from that pool can
> be entered while the allocation is in progress (e.g. by the same interrupt
> occuring or by another interrupt leading a different code path to allocate
> from that pool).
> 
> A quick grep through src/sys/net for PR_NOWAIT (which is a pretty strong
> hint that the code in question may be reached from an interrupt) found some
> problems in the SACK code, which Kentaro fixed.  However, there is a huge
> amount of code in the kernel which allocates from pools, and some of it
> does so "maybe" from interrupt context (e.g. setting the flags from a
> "waitok" argument to the calling function, so that my grep would not
> have found it).
> 
> I ask all developers to _please_ look at any code in which they have
> used the pool allocator and double-check that any uses of pool_put/pool_get
> which could be reached from interrupt context are bracketed by the correct
> spl/splx calls to block such interrupts.

Do you know which pool gets corrupted ? As pools works on full pages, and
items are usually much smaller, it's likely that the item which gets
corrupted is of the same kind as the one that caused the corruption.

Also, can you give details on the hardware used for the cluster ?
In case the issue is in a hardware driver.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--