tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: WAPBL fix for deallocation exhaustion + slow file removal



> Thanks for taking a shot at this!  But I think it needs a little more
> time for review -- certainly I can't digest it in the 24 hours you're
> giving.

Sure.

> From a quick glance at the patch, I see one bug immediately in
> vfs_wapbl.c that must have been introduced in a recent change:
> pool_get(PR_WAITOK) is forbidden while holding a mutex, but
> wapbl_register_deallocation does just that.

I'll recheck this, thank you. I run it under rump only, I think I had
it compiled LOCKDEBUG, so should trigger the same assertions as
LOCKDEBUG kernel. But apparently not.

> What happens if wapbl_register_deallocation fails in the middle of
> ffs_truncate, and then the power goes out before the caller retries
> the truncation?  It looks like we may end up committing a transaction
> to disk that partially zeros a file -- and it is not immediately clear
> to me who will be responsible for freeing the remaining blocks when
> the system comes back up.

The indirect disk blocks go via the journal also, so if power fails
while the journal is not yet committed, it would work just the same as
usually, i.e. no change would be visible on filesystem. If truncate
fails once, then this partial truncate would be committed, and power
fails before the truncate is completely finished, then the log replay
would just replay just the first part, creating the hole.

> I don't understand why the possible failure in the fragment extension
> code should not be worrying.  `Only 7 times during a file's lifetime'
> says to me that if I check out and delete the NetBSD CVS repository
> once, there are about two million times this can happen.  If we can't
> restart the transaction safely, we need to find another way to do it.

The fragment deallocation is only registered when a file is extended
from size less then full block, and only when there is no free block
available just next to the fragment. This can't happen more then once
per data write operation, hence once per each
wapbl_begin()/wapbl_end() block.

The two million file creates are all in different vnode calls, each
call checking the limits in wapbl_start() and flushing if necessary.
It would be extremely difficult to trigger transaction overflow there.

The deletions won't cause any problem, since on that path we use
!force path, so can't overflow transaction.

> Can you say anything about alternatives you might have considered that
> don't entail adding undo/restart logic to every user of UFS_TRUNCATE
> and why they were unfit?

Right now every caller of UFS_TRUNCATE() does it implicitely, via
ufs_truncate(). That function right now tries to avoid big transaction
by truncating on indirect block boundary with wapbl, but that fails to
account for bigger directories and of course is slow.

There is alternative patch in kern/49175, which merely changes
ufs_truncate() to check number of real blocks and adjust truncate size
to fit the transaction within that limit. That patch however
exclusively locks the transaction in order to avoid some parallell
operations to exhaust the dealloc limit, and I don't like how it moves
logic about allocated blocks away from ffs_alloc/ffs_inode.

I considered another alternative with dealloc pre-registration,
avoiding the exclusive lock also. But that would still require partial
truncate support similar to my patch in order to avoid stall when
there are several parallel requests, so basically it doesn't bring any
extra value. Pre-allocation also requires some accouting hence extra
code, so it looked just simplier to go without.

Jaromir


Home | Main Index | Thread Index | Old Index