tech-kern: Re: hashinit.9 [Re: 1. uiopeek? 2. hashinit/hashdone?]

Subject: Re: hashinit.9 [Re: 1. uiopeek? 2. hashinit/hashdone?]
To: None <tech-kern@NetBSD.org>
From: Chapman Flack <nblists@anastigmatix.net>
List: tech-kern
Date: 06/05/2006 11:15:42
brian@surge.insomnia.org wrote:
> Without the man page, you have to infer the contract. This inference 
> basically comes down to this:
> 
> /*
>  * General routine to allocate a hash table.
>  * Allocate enough memory to hold at least `elements' list-head pointers.
>  * Return a pointer to the allocated space and set *hashmask to a pattern
>  * suitable for masking a value to use as an index into the returned
>  * array.
>  */
> ...
> Note the things this does not say:
> 
> 1: It does not say the implementation is powers of two (it is)
> 
> 2: It does not say that hashmask can be used to represent the full range 
> of indexes.

Ok, let's begin then with only the inferences you have already
allowed.

"set *hashmask to a pattern suitable for masking a value to use as an
index into the returned array."

==> for value v, otherwise unspecified, v&*hashmask may be used as an
    index into the array.
    
    (No inference yet, just clarification and use of definitions)
    
    Question: the comment does not state in so many words that
    v&*hashmask may be used as a /valid/ index into the array. Does that
    give us any other defensible way to construe it?

Going on. It has said nothing at all about this value v or where it
could come from. We must, however, consider the case where v is obtained
by putting some kind of a key through some kind of good hash function.
Notice I am not saying we /must/ assume that v /has to be produced/ that
way, but only that it is one use case that the contract for a hash table
cannot have meant to rule out. This permits an adversary style of
reasoning to be used for the next question:

    Question: what must be true of *hashmask to ensure that
    v&*hashmask is never an index out of bounds?

    Next Q (first form): what must be true of *hashmask to ensure that
    there is some distribution of values v for which the hashing scheme
    uses the allocated space without waste?

Here you may perhaps object that I am assuming no waste is tolerable.
Could there not be some possible implementation that would trade a bit
of wasted space for some other engineering benefit? So let's be generous
and say we're willing to contemplate schemes that waste any amount up
to, but not including, half of the total allocated array.

   Q (restated): what must be true of *hashmask to ensure that there is
   some distribution of values v for which less than half of the
   allocated array space is wasted?

I'm being careful not to require you to ensure the property for
/any old distribution/ of v, because you can't. I could write you
a hash function that would waste most of your array no matter how
you constructed *hashmask. So the question is only, how can you
construct *hashmask to avoid ruining a good distribution by wasting
space anyway?

   Q: given what must be true of *hashmask, what is true of the array
   size?

If your last answer involves an inequality rather than a strict
equality, explain any purpose you think could be served (without
altering the interface) by choosing any but the bounding value.

I think you'll get the picture.

This thread does make a valuable contribution by illustrating exactly
why it's important to document such things. You simply can't count on
every person to draw every necessary inference every time. Sometimes
even the most experienced people are in a hurry or have other things on
their minds. Necessary inferences that are a few deductive steps removed
from grepping the bare text will sometimes be missed. And every time
that happens there is some risk of trouble: "Hmm, I don't suppose it
could break anything if I change such-and-such in this
implementation...."

> I was using the word chains to keep the names consistant with your 
> ...
> IE: the variable the source code calls elements. I personally
> prefer the source's naming than the man pages; is there a reason
> they differ?

Yes: terminology. Elements are the things you store in a hash table.
For any hash table, there is a chance that two or more elements will
hash to the same value. There are different ways to deal with such
collisions, and one is chaining: you make each hash slot a chain of
zero or more elements. To get the table the right size, you pass
hashinit the number of chains you want; this != your expected number
of elements, unless your design load factor is 1.

The variable in the code has been called 'elements' ever since the
4.4bsd-alpha version, which is before it contained a built in
assumption that chaining would be used. That version also used a
sizing scheme that allocated some number of slots between the
value you passed and half that value. It's possible and reasonable
that the author of that code, ca. 1992, really meant elements. The
code has changed since then; so far, the name of the variable hasn't.

> Saying 'dont do this' is pretty inexpensive.

It also isn't a proposal of any language that could realistically be
added to the man page to communicate to the reader what it is he's being
advised not to do. If you try your hand at actually wording it,
I think the effort will also clarify for you why hashsize*esize does
not depend on the size of the user's chains, and what it does depend on,
and that will help you estimate the likelihood of hitting that
particular implementation limit, and judge the number of words you
think it is worth in the man page.

-Chap