NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/59885: zfs: unlink/rm is slow to delete last link because it always zil_commits



>Number:         59885
>Category:       kern
>Synopsis:       zfs: unlink/rm is slow to delete last link because it always zil_commits
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 03 21:50:00 +0000 2026
>Originator:     Taylor R Campbell
>Release:        current, 11, 10, 9, ...
>Organization:
The NetZFS Slowdowndation, Inc.
>Environment:
>Description:

	We have a local change to zfs to commit the zil whenever a zfs
	vnode is reclaimed, which usually happens when deleting the
	last link to a file:

	/*
	 * Operation zfs_znode.c::zfs_zget_cleaner() depends on this
	 * zil_commit() as a barrier to guarantee the znode cannot
	 * get freed before its log entries are resolved.
	 */
	if (zfsvfs->z_log)
		zil_commit(zfsvfs->z_log, zp->z_id);

	This is required because the logic to write data for file the
	to the log (zfs_get_data), queued up by past writes to the
	file, relies on acquiring a reference to the vnode as indexed
	by the zfs object id (equivalent of inode number) and using the
	struct znode before it is freed.

	That logic might run after the vnode has gone through
	VOP_RECLAIM, but NetBSD's vnode life cycle treats reclamation
	as final and forbids acquiring new references or even looking
	up the vnode by its object id via vcache(9), and
	zfs_netbsd_reclaim unconditionally frees the struct znode with
	zfs_znode_free immediately afterward (all paths out of
	zfs_zinactive go, either directly or via zfs_rmnode and
	sometimes then via zfs_znode_delete, through zfs_znode_free):

	if (zp->z_sa_hdl == NULL)
		zfs_znode_free(zp);
	else
		zfs_zinactive(zp);

	Committing the zil first avoids this trouble.  But committing
	the zil is costly (requires writing all pending transactions to
	disk and flushing the disk cache and updating root pointers and
	so on), much costlier than just logging a file operation like
	unlink.

	And removing the last link to a file causes its vnode to be
	reclaimed synchronously, essentially every rm(1) or equivalent
	(in the absence of multiple hard links to a file) triggers this
	logic, making it very slow.

>How-To-Repeat:

	rm -rf /large/directory/tree

>Fix:

	It was a huge improvement to the reliability and
	maintainability of NetBSD's vnode life cycle that we began to
	forbid reviving vnodes from the dead about a decade ago, so
	reversing that decision is a non-starter.

	But perhaps we can add a reference count to the znode itself,
	when it is pending in a log transaction for zfs_get_data later,
	so that it is only freed after both zfs_netbsd_reclaim _and_
	zfs_get_data are done with it.  Note that the only use that
	zfs_get_data makes of the _vnode_ is to release a reference
	(which we have made into a no-op because it is taking that
	`reference' only during reclamation when acquiring new vnode
	references is forbidden).

	This will also require making sure the object id is not
	recycled too early -- possibly by some combination of
	ZFS_OBJ_HOLD_ENTER and ZFS_TEARDOWN_INACTIVE_ENTER/EXIT_READ,
	not sure.



Home | Main Index | Thread Index | Old Index