tech-smp: Re: NetBSD 1.6 & i386 SMP, ACPI, mly

Subject: Re: NetBSD 1.6 & i386 SMP, ACPI, mly
To: Brett Lymn <blymn@baesystems.com.au>
From: John Franklin <franklin@elfie.org>
List: tech-smp
Date: 01/15/2003 23:23:39
On Thu, Jan 16, 2003 at 02:04:12PM +1030, Brett Lymn wrote:
> On Wed, Jan 15, 2003 at 10:09:05PM +0100, Manuel Bouyer wrote:
> > 
> > I've got good results with the 1.6 branch + patches from
> > i386mp_plus16_stable.
> >
> 
> OK - but does that mean it is production ready?  Don't get me wrong,
> what we have is fantastic and it is deeply appreciated by me (at the
> very least ;-) but we must be wary of overselling what we have lest we
> get labelled with the "piece of crap" token.  I have experienced
> problems with -current SMP on my i386 boxen.  Normally these are
> solved by a "cvs update" and rebuild but whereas I can live with that
> on my home boxes, I doubt if it would be very satisfactory for a
> production environment - at least any production environment that I
> have had experience with would not tolerate it. 

NetBSD (nor any other open source OS of which I'm aware) does heavy
regression testing or have a serious suite of tests against which the
kernel and userland is tested.  There is the /usr/src/regress directory
which does have some tests, but it's pretty sparse.

"Production code" for OS OSes means the code maintainers have stopped
adding new functionality into the system and started the bug hunt.  The
community as a whole assists by running the feature frozen code on a
wide variety of systems in a wide variety of environments for a wide
variety of purposes and reports any problems, with patches where
possible.  This generally takes several months of world-wide effort.

All it really means is that "production code" is "statistically stable."
Nobody in the NetBSD community can guarantee the code beyond, "it works
pretty well on my systems, I haven't heard of any major problems."

"Statistically stable" and "works well on my systems" still produces
high quality code for the simple reason that code quality is rated on
exactly those two criteria: How often does a failure occur and how high
is the performance on observed systems.

Code in -current is never "statistically stable" as it constantly sees
new problems.  During the week the MPACPI code was being added com* and
lpt* interrupts were broken.  Broken interrupts automatically preclude
code from being placed on production servers.  This breakage was
expected, though, as the interrupt code was exactly what the MPACPI was
affecting.  By the end of the week MPACPI was cleaned up and all was
well again, better even.[1]  

Still, that week interrupts were broken.  Another week it might be MP
code or PCI busses, or SCSI adapters.  Next week, for example, you can
expect some problems with the scheduler when the scheduler activations
code is folded into -current, but by the end of next week there will be
some very happy NetBSD hackers as the issues are ironed out (not to
mention a couple blokes richly deserving of some pints.)

If you want -current code that is production ready, you'll need to
statistically verify it yourself.  This means setting up lots of servers
and running a suite of tests on them, while simultaneously watching the
patches on the main truck and selectively adding them.  This can be a
lot of work, especially since you'll be tempted to fold in new features.
If a normal, sanctioned release cycle takes months, you can expect your
work to take at least as long. 

All that said, -current is still a remarkably stable system, more so
than some commercial OS releases of yesteryear.  How high a quality
system you need for your production environment, how much time and
resources you're willing or able to devote to it, how hard your
requirements are... these all determine for you if it's "production
code."

Good luck!

jf
[1] See my prior posts on USB & MP on my quirky VIA-based MB.
-- 
John Franklin
franklin@elfie.org
ICBM: 35°43'56"N 78°53'27"W