netbsd-bugs: Re: kern/33076: reproducable pool free list corruption

Subject: Re: kern/33076: reproducable pool free list corruption
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Martin Husemann <martin@duskware.de>
List: netbsd-bugs
Date: 03/16/2006 14:05:06

The following reply was made to PR kern/33076; it has been noted by GNATS.

From: Martin Husemann <martin@duskware.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/33076: reproducable pool free list corruption
Date: Thu, 16 Mar 2006 15:04:25 +0100

 Ok, with help from Frank van der Linden and Chuck Silvers I have examined
 this a bit more.

 The original problem happened with a SMP kernel, and core dumps there are
 quite fragile, so I never managed to get one.

 One suspicion was that pool operations on mbpool would happen without proper
 IPL - so I added a panic in pool_put and pool_get that would trigger if the
 pool was mbpool and current protection level < IPL_VM. This did not fire.

 During testing, a second variant of the corruption occured, in form of a
 kernel page fault inside pool_prime_page (called from pool_get). The pointer
 dereferenced was 0xffffffffffff. So I added options QUEUEDEBUG, and this
 catches the same corruption slightly earlier at the LIST_INSERT_HEAD.

 Still it does not point out where the corruption occurs.

 Now finally I repeated the experiment with a uniprocessor kernel and had the
 same result - this time, however, I was able to get a crash dump (on second
 try, so I'm not sure how correct the backtrace in it will be).

 I've uplodaded all relevant pieces to 

   ftp.netbsd.org:/pub/NetBSD/misc/martin/crash

 There is the kernel config file (MARTINS.UP), the kernel core and netbsd.gdb,
 as well as the small patch to subr_pool.c that I used to verify the pool_pug/get
 protection level.

 If I should guess, I would say something in the network stack is writing
 a 0xffffffffffffffff somewhere out of bounds.

 Martin