Subject: LFS vnlock: lfs_cleanerd in realloc: pointer to wrong page
To: None <tech-kern@NetBSD.org>
From: Edgar =?iso-8859-1?B?RnXf?= <efnbl06@bn2.maus.net>
List: tech-kern
Date: 05/09/2007 17:57:59
Any LFS experts round here?

I've made some progress analyzing the vnlock lockups I experienced on LFS.
The initial culprit seems to be lfs_cleanerd which ceases to clean (more later).
This causes another process to wait for lfs_reserve with a vnlock held on
the LFS's root inode (just a few lines below an XXX comment that this is
probably a bad idea).
Then, lots of further processes lock up in vnlock.

I was mostly able to recover from this situation by:
Killing the process waiting for lfs_reserve
Killing the cleaner (with ABRT so I have a core dump)
Manually restarting another cleaner
However, even after unmounting the LFS partition and fsck'ing it (with no
errors), the cleaner now complains:
 sumsum magic number bad: read 33393536 expected <something else>
 data checksum bad: ...

I ktruss'ed the cleaner that ceased to work and that revealed messages
 in realloc: pointer to wrong page
that showed up nowhere else.
I have a core dump of that cleaner so I'm able to investigate it.

Apart from the obvoius question (what's wrong with the clenaer):
-- do I have to worry abount the bad magic and checksum?
-- what about the #if 0 release of vnlocks prior to sleeping on lfs_reserve?

However, when I discovered the problem this morning I never thought I could
get away without rebooting the machine. I also never thought I was going to
investigate kernel locking issues some day.