Re: FS corruption because of bufio_cache pool depletion?

On Tue, Jan 26, 2010 at 03:32:23PM +0100, Manuel Bouyer wrote:
> Can you give more details on the corruption ?
> Was it only directory entries that were corrupted, or did you notice
> corruptions in the data block too ?

I was seeing corruption in data blocks too.  That's what I meant, when I
mentioned corrupt CVS/Root files.   Fsck complained about directories that
were corrupted right at the start of the data block.  I think I didn't save
the error messages.  But "." and ".." were corrupt or missing.

I have a netbsd-3/Xen 2 based server that runs on the same hardware and we
have seen FS corruption in a particular domU on that system taqt seems to be
related to the file system running out of space.  That's what the co-admin
running that domU tells me anyway.  But I haven't seen the damage or the
error messages in the domU personally.

> > raid1: IO failed after 5 retries.
> > cgd1: error 5
> > xbd IO domain 1: error 5
> It seems raidframe doesn't do anything special for memory failure.

Greg tells me that raidframe does retry several times.  And the above error
indicates that it retried 5 times.

Note that I only got the above message exactly once.  But the pool stats
indicated several hundred allocation failures.

I am contemplating collecting stack traces when getiobuf can't get a buf from
the pool and maybe checking that it does always get a buf when it is called
with waitok==true.

I wonder if the b_iodone issues you are investigating have an impact on this.


