Keeping Compatibility - yes please!

To: Antti Kantee <pooka%cs.hut.fi@localhost>
Subject: Keeping Compatibility - yes please!
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Thu, 22 Jan 2009 17:22:53 +0700
    Date:        Tue, 20 Jan 2009 16:32:33 +0200
    From:        Antti Kantee <pooka%cs.hut.fi@localhost>
    Message-ID:  <20090120143233.GA27989%cs.hut.fi@localhost>

  | Seriously though, I almost agree with you, but I would propose to drop
  | compat for cases which can't be accomodated with reasonable difficulty.

Sorry, that statement/policy is meaningless without some kind of idea
what "reasonable difficulty" means.    If you mean that we shouldn't
worry about it if it is impossible, then it is kind of hard to disagree.
Kind of, but not completely - the question then becomes more whether the
change that causes this impossible compatibility problem is really worth the
loss, and if it is a change that is needed, is it being done the right way?
And then just maybe, if the compatibility issue isn't too serious, we say "OK".

On the other hand, if "reasonable difficulty" means "I'd have to think a
bit then spend half an hour coding, and that's unreasonable for me", then
I certainly would disagree with your proposed change.

  | Out of our examples, it is not unreasonable for kernfs to support
  | dynamically registering nodes.  This even has nothing per se to do
  | with compat.

Can't comment on that as I have no idea what this relates to, but if
it has nothing to do with compat, then it most likely isn't relevant
anyway.

  | What would break if the routing socket wouldn't support old message types?

Ah - routing ... the routing socket is how the routing daemons communicate
with the kernel.   Whether or not they'd break in a way that is important
depends upon just which messages you're proposing not be supported.

But if I boot a new kernel, then can't connect to it, because my
network packets aren't being processed correctly by the routing system
(perhaps because bgp stops exporting my routes to the internet)
so I can't even get in to shut down cleanly to revert to the previous
kernel, I am not going to be very happy...

  | If it's just a handful of system utilities, I don't think upgrading the
  | whole system is an unreasonable requirement.

It depends what the system utilities do.   If the only thing broken is
some report that a human might like to look at occasionally (like perhaps
the reports that netstat -r doesn't work from old netstat's on a current 
kernel, if that's the limit of the breakage, then I can handle that - if
I need the output, I can just compile a new netstat (even run it in a
chroot with a new libc and whatever else it needs in really hard cases,
I have done that kind of thing before).  The whole system wouldn't need
to be upgraded for that, so reverting to the old remains trivial.

But if the system doesn't operate correctly (including everything that's in
use to make overall system operation), then the kernel should be
providing enough COMPAT_NN glue to fix it.

  | It's not like the compat
  | is perfect even now, though we might want to claim it to be.

I don't know who claims that, almost nothing is ever perfect.  Mostly
to now NetBSD has been doing a very good job of keeping at least
release, and usually, anything that's existed for a long time, even if
not yet released, operating in a compatible enough way that we almost
never have any significant issues just running new kernels on old userlands.
That's the way it should be.  Minor annoyances (I've had ifconfig refuse
to work, requiring use of a new version for some features, but as long as the
basics are OK, that can be tolerable.)

  | (yes, I value compat.  I'm currently running 5.0_BETA kernel on a
  | 4.99.30'ish userland.  But maybe it's a better idea to support rollback
  | of userland utilities than try to pile a ton of cornercase compat into
  | the kernel)

No, it isn't.

Doing anything like that, any real relaxation of the compatibility demand
would just be NetBSD shooting itself in the foot.

Someone (I forget who) claimed in an earlier message on this issue that
failing to update everything (ie: just updating the kernel and no more,
just like you said you're doing) is "lazy".   That's wrong.   Failing to
provide compat code is lazy (either lazy in just not doing it, or lazy in
not finding a forward path that allows it to be done).  Failing to
(always) upgrade everything is just prudent (cautious) - it is the
intelligent way to upgrade any real system (ie: any system that's used
for real work, from a laptop that you read your e-mail on to a server
providing vital databases services to your company - as opposed to a toy
system used for testing only that you can trash whenever you like).

Regardless of how much testing a system has had, there's no way to know it
is going to work when put into production other than actually testing it
in production.   Setting up test environments, duplicating everything, ...
all help, but none of them place the same pressure on the system as actually
being used.

With that, any upgrade needs a quick, safe, rollback method.  Your "rollback
of user utilities" might suffice for that, it depends upon how reliable it
turns out to be, but in any case, it has to be provided as a tool with the
old system before it is upgraded - for every possible old system that anyone
might ever want to upgrade to the new one (because it has to be able to run
with the old kernel if the new one fails) - and it needs a failsafe mechanism
that can automatically go back to the old system with all old binaries, even
if the new kernel appears to be running - just in case while "running"
the new kernel has managed to lock out admins from being able to reboot.

All that would be a lot of work.

But even if it looks to be a good idea, because it could be done once,
then forever into the future we'd never have to worry as much about
compat, it is still the wrong approach.

Go back to the routing socket issues again (whatever they are right now).
Then recall that some of the main users of the routing socket are from
pkgsrc (net/quagga, net/xorp, perhaps net/net-snmp, I'm not sure) and it isn't
just system utilities that you need to be able to supply new versions of,
and rollback to older, but pkgsrc additions as well.  Do you really want
to have to investigate every package that exists to see if any will
require recompiling for some proposed incompatibility?   Which is more
work, that or providing compat code??

What's more, can you investigate and provide new versions of my code
(that you don't have the source to) so you can see whether a new version
of that is needed with the new kernel?   What if I have my own private
routing daemon (using the routing socket) that you know nothing about?
How are you going to keep that one working with a rollback scheme?

Last, I'm sure you've seen, and perhaps even generated, messages of the
form "can you try a current kernel and seen if the problem still occurs?",
or "I have compiled a new kernel for you with a patch, can you boot it
and see if that works?" when developers are attempting to help users
overcome problems (bugs, or missing features, drivers, etc).

That kind of thing happens reasonably frequently, and is greatly appreciated
by those who get the benefit of it.

But, if in order to follow one of those suggestions the user has to make
any kind of sweeping changes to their system they're just going to say "no".
Then they don't get the help they need, and the developer gets no feedback
as to whether the bug is actually fixed or not.  The latter is perhaps the
more important, we all know how difficult it can sometimes be to find test
cases for some of the weirder bugs, you don't want a potential source of
help from someone who is experiencing a problem to vanish because they refuse
to assist you to test the fix.   Remember here the user doesn't want to
upgrade, they just want the bug fixed in the system they're using.

Eg: suppose I was to report a problem in NetBSD 4 (driver error, filesystem
lockup, scheduling weirdness, memory leak, or whatever), and some developer
(perhaps you) recognises the problem and says "I think I fixed that last
week, can you try a current kernel and see if the problem is still there?"

Right now, with current NetBSD policy, I just say "sure" and either download
or compile a current kernel, copy it to /netbsd.test on the affected system,
schedule a convenient time for a reboot, and test the thing.   A few minutes
of testing and back to the old kernel, with a "yes it is fixed" or "no, the
problem is still there for me, this is what happened..." report for you.

If fixed, we now can decide whether it is possible to put the fix back into
NetBSD 4, if you can do that, you supply me a patch, and I test that one,
then you request a pullup, and everyone is happy.   If the change is too
intrusive for that, then I get to decide whether the problem is serious
enough that I should consider upgrading my system to NetBSD 5 or even current
in order to get the fix, along with dealing with all of the issues that
any upgrade causes production systems - but at least I know there is a
fix available, and my problem is not just being ignored - once again I'm happy
(or fairly happy.)  If the problem isn't fixed then we all (especially you)
know that the problem still exists and needs more work, which is useful itself.

On the other hand, if in order to test the new kernel, I'm going to need
some undefined number of other changes to my system in order for it to work
well enough to test, or, then I'm probably just going to say "sorry, I
can't run a current kernel, put your change in NetBSD 4, and I'll test that".
At that point, if you're anything like me, you're going to say "Unless I
know it is going to fix the problem, that's too much work".

In the other case that happens, where you (now) supply a pre-built kernel
for me to test (perhaps because the fix is very new code, that you're not
ready to distribute quite yet, or isn't a fix but just a system that contains
lots of debug/diagnostics to get more information) right now that's no great
concern, not hard for you (you're compiling kernels to test your mods locally
anyway) and not for me - but if you have to build a complete release (perhaps
including all of pkgsrc), and I have to download that and install (much) of
it (and then uninstall it all again as soon as the test is over), once again
I'm just going to say "no".

NetBSD's current compatibility policy is important, we need to keep it,
even strengthen it wherever possible (mostly by gradually redesigning those
interfaces that turn out to cause problems to make them more stable).
Weakening it would be a disaster.

kre
References:
- Re: CVS commit: src/sys
  - From: Antti Kantee
- Re: CVS commit: src/sys
  - From: Andrew Doran
- Re: CVS commit: src/sys
  - From: Christos Zoulas
- Re: CVS commit: src/sys
  - From: Antti Kantee
- Re: CVS commit: src/sys
  - From: David Holland
Prev by Date: CVS commit: src/sys/dev/pci
Next by Date: CVS commit: src/dist/nvi/common
Previous by Thread: Re: CVS commit: src/sys
Next by Thread: Re: Keeping Compatibility - yes please!
Indexes:
Home | Main Index | Thread Index | Old Index