tech-perform: RE: namei caching of newly created files?

Subject: RE: namei caching of newly created files?
To: Luke Mewburn <tech-perform@NetBSD.org>
From: Gordon Waidhofer <gww@traakan.com>
List: tech-perform
Date: 01/19/2005 21:06:40
> -----Original Message-----
> From: tech-kern-owner@NetBSD.org [mailto:tech-kern-owner@NetBSD.org]On
> Behalf Of Luke Mewburn
> Sent: Wednesday, January 19, 2005 7:35 PM
> To: tech-perform@NetBSD.org; tech-kern@NetBSD.org
> Subject: Re: namei caching of newly created files?
> 
> 
> On Wed, Jan 19, 2005 at 05:47:31PM -0800, Bill Studenmund wrote:
>   | Becasue we don't build in-kernel lists of files in the directory, we have 
>   | to do a full search (O(n)) of the directory to see if the name is or isn't 
>   | there already.
> 
> FreeBSD implemented UFS_DIRHASH a while ago:
> 	
> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=685987+0+archive/2001/cvs-all/2001071
> 5.cvs-all
> 
> NetApp did the same thing for their WAFL file system over a decade ago:
> 	http://www.netapp.com/tech_library/3006.html
> 
> 
> We should just lift the working UFS_DIRHASH implementation from FreeBSD.
> Three years ago I did a work-in-progress port of UFS_DIRHASH from FreeBSD
> but I didn't finish it.  I have the code around somewhere.
> 



With all due respect to whom it may concern......YECH!!

We (Traakan) also did dir hashing circa 1994, but no paper.
I believe our software release predates NetApp's
(coincidence?). I'm reasonably sure it was a common customer's
requirement that drove both designs.

I think I'll go ahead and disclose this since it is
now 10 year old stuff.

Here's a better way to do it than the incore side table.
It is not backward compatible with existing large directories
but is for the normal, one block directories. A case can easily
be made for why it need not be backward compatible.

The code impact is negligible.

95+% of directories are a single block. The rest range
from a little big to REALLY REALLY REALLY big. This algorithm
is ellegance itself for the 95+% case because it results
in no real costs and is also backward compatible. It is a
HUGE win on big directories without a lot of incore side
table weirdness. It is an expanding hash so small stays
small, a little big stays a little big, and bite the bullet
for really really big.


int	buckets[] = { 1, 8, 64, 256, 256, 256, 256, .... };

lookup (dir d, char name[], result *ret)
{
	int		key = hash(name);
	int		base = 0;
	int		i;
	int		blockno;
	int		class;

	for (i = 0; buckets[i]; i++) {
		class = key % buckets[i];
		blockno = base + class;
		rc = lookup_in_block (d, blockno, name, ret);
		if (rc > 0) {
			return 0;
		}
		base += buckets[i];
	}
	return ENOENT;
}

lookup_in_block (dir d, daddr_t blockno, char name[], result *ret)
{
	// if blockno of d is empty (a hole), return 0
	// read logical block blockno in directory d
	// sequential search for name[], note any empty slot
	// if found, fill in ret
}

A superblock option bit or distinct magic number
should control whether or not directories work this
way.

Regards,
	-gww