Subject: Re: the path from nathanw_sa -> newlock2
To: Aaron J. Grier <agrier@poofygoof.com>
From: Perry E. Metzger <perry@piermont.com>
List: current-users
Date: 02/12/2007 16:56:46
"Aaron J. Grier" <agrier@poofygoof.com> writes:
> Q: why weren't the M:N bugs able to be fixed?
> A: the M:N code was complex and difficult to maintain.  nobody except
>    Nathan understood it, and he didn't have time to work on it.

Others understood it to a greater or lesser extent, but more or less
no one seemed to be able to kill all the bugs, and some of them were
quite severe. Nathan no longer seemed to have time to attack the
problem, but I'm not sure that was the only issue. Code like this is
insanely complex, and complexity breeds bugs. It isn't clear that even
an ideal group of developers with lots of time could have made it work
right -- consider that Sun ultimately abandoned M:N as unmaintainable
and had no shortage of people or resources.

> Q: if the M:N code was so complicated, why was it ever merged into
>    -current?
> A: at the time, NetBSD had no kernel threading support whatsoever, and
>    it was becoming clear that user-land libraries like pth and proven
>    threads weren't going to cut it.  given the choice between
>    experimental M:N threading code and no threading code, the choice was
>    obvious.

It wasn't obvious at the time that we'd have so much trouble making
the M:N code solid. There was no thought of it being an "interim"
measure or "better than nothing" -- it was not understood at the time
how hard it would be to get it working correctly. At the time we
believed that it would be the right solution long term.

> Q: why was M:N switched out completely for 1:1 without any overlap
>    period?
> A: resources were not available to deal with porting the existing M:N
>    code to the new kernel locking primitives due to the complexity of
>    the existing M:N code.  plus the M:N code was buggy.

If it had been simple to carry the code forward, perhaps it would have
been done, but it was not simple or even apparently feasible.

> Q: new locking primitives?
> A: the newlock2 branch started as a project by Andrew Doran to increase
>    the amount of parallel code paths in the kernel to avoid the "big
>    lock" which doesn't scale well.  given that multi-core processors are
>    the current standard for many architectures, this was deemed of great
>    importance by core.

I believe the newlock stuff was started by Jason Thorpe, actually, but
Andrew certainly brought it to fruition.

Perry