Subject: Re: Config ...
To: Chris G. Demetriou <cgd@netbsd.org>
From: Stefan Grefen <grefen@hprc.tandem.com>
List: tech-kern
Date: 08/21/1998 23:04:41
In message <87ogtewcoq.fsf@netbsd1.cygnus.com>  Chris G. Demetriou wrote:
> Stefan Grefen <grefen@hprc.tandem.com> writes:

[...]

> > 
> >     Today I would end up with smc99 and wd180, after some
> >     iterations, which is completly bogus. With devices realy gone 
> >     I would have to redo all the network configuration every time.
> 
> Actually, if the detach code were actually finished, it'd end up
> repeatedly attaching and detaching the same device numbers.  With
> multiple slots, you could get into the situation where you always use
> more and more unit numbers, but that is a bug, and easy to fix.

Yes, it can be fixed, but this mandates to drop any kind of state over a 
removal.

> 
> Re: network configuration: "that's what a daemon to sense that you've
> inserted a card and run a script based on that is for."  Also,
> realistically, in the 'modern world' (all i really care about,
> w/pcmcia), you want to be using DHCP on your network interfaces, to
> configure them and find you a good IP address.

I run DHCP on my laptop, and roaming works ok (even between Frankfurt and
Cupertino, I had only to reboot to get cron & friends in the correct 
timezone:-)) as long as the disconnection time is bigger than the DHCP 
check-interval, but for just removing the card for a short-time or so is 
overkill.  If would just go dormant the TCP connections could survive.

> 
> 
> >   2) Scsi:
> >     If you rescan the SCSI bus and the user has switched the scsi-id of
> >     two devices what do you do? 
> >     Assume it's to tape drives? 
> >     I know you shouldn't do it, but Joe blow will do it and scream if
> >     his date goes south.
> 
> You detach the first instance, attach the second.

The problem is that after the switch the same device ends up on a different
physical drive. 

> 
> >     More common: you turn of one of your disks/tapedrives. 
> >     Are you going to renumber the remaining ones?
> 

> Already-attached devices should stay attached, and not be detached
> unless the kernel happens to sense that they've gone away.  (I.e. if
> you don't tell the system to freeze the SCSI bus -- however you might
> do that -- and disconnect stuff temporarily, it might to away.  But
> that's your fault because you didn't freeze the SCSI bus.)

I think you can't renumber them, but where do attach the next device?
Let's assume a drive on a different SCSI ID get turned on. It gets the
device ID from the detached device. Now you turn the first one again.
Its end up on a different device. 
This happens (I've some disks I turn on only when I need them, mostly
for noise reasons).

This is dangerous for the data and not intuitive for a naive user too.

> 
> > The problem is that there may be device information kept at places where
> > you don't expect it (May be implicitly in some daemon process). You can't  
> > delete it all when the kernel wants to unconfigure the drive. (It would
> > be a major effor I fear doing it on the kernel level only).
> > I tried that a long time ago (for my old pcmcia code).
> 
> In the case of things like ethernet, you already have that problem
> when, e.g. you want to switch networks.  The solution is _not_ to come
> up with a horrible kludge, it's to _fix_ the programs so that they can
> cope with a dynamic environment.
> 
> No, not an easy task.  Yes, the right solution.
> 

It is the right solution, but we will reach it only in an infinite amount
of time. The device ID could at least prevent damage in the meantime.

> 
> > > If you detach a device, it should go away.  Completely gone.  Not
> > > 'flagged inactive', not 'kinda-sorta there.' Gone.  detached.  no
> > > longer a valid device.
> > 
> > But you can delete all 'secondary' knowledge of the device. That get nasty 
> > if you do a bud-rescan.
> 
> Rescan shouldn't reassign unit numbers to children.  It should simply
> say "are there any children here which aren't attached, and if so,
> let's attach them."  There's no point in renumbering everything just
> to handle a new device.

No, but if you really delete a device, its minor/instance number will be 
reused and thats were the problems start. The 'secondary' knowledge may
be out there and take the new device for the old one. 

> 
> > The kernel calls a function to take down the device, calls a bus specific
> > function to reclaim io-space/interrupts.
> > Than switches all entry points to error returns.
> > This can be hidden in specfs and the generic network interface code.
> 
> ... and any other place that might need access to the device, of
> course.  tty layer, if the device is open, etc.  I.e. you kludge a lot
> of things in a lot of places, to avoid the complexity of doing detach
> right.

My assumption is that I can't completly detach (not with todays applications).
I also don't think that it will get better, so I try to make the OS robust
against pilot-errors and bad software.

> 
> > There should be an option where the user can say this device is really
> > gone.
> 
> And the user who doesn't know about this and who goes from card to
> card to card (which should work fine) gets stuck with an increasing
> number of devices and kernel bloat?  "no thank you."

You mean somebody inserting 10 different D-Link ethernet cards??
Yes he'll get increasing number of devices. But I think you can find
100 people removing/inserting the same D-Link before you find somebody
doing it with different cards.

> 
> The logical behaviour of devices is that when you detach them, they
> are gone.  As far as your computer's concerned, when you've taken your
> ethernet out, you may well have put it straight into a shredder.  As
> far as I can tell, as far as a naive user is concerned, that is true
> as well.  They just want whatever card they happen to stick in to work
> (via dynamic user-land adjustment of configuration parameters,
> e.g. start up dhcp on the network interface) and when they take it out
> it's _gone_.
> 

I think a  naive user would prefer that if he just removes the card for
a short moment his telnet session should stay up.

I think you can justify both ways with the 'naive user proof'
(the 'naive user proof' just proofs that the dwim (do what I mean) function
hasn't been implemented).

The naive user expects the device to be gone if he puts the card in the
shredder and to stay if wants to put it back in.


[...]

> > 	1) LKM is not in a state to really allow that (only for the basic
> > 	    driver)
> 
> Uh, bug, not feature.

Yes, any objections when I start to fix that? Problem is that if you
want to do demand-loading you need a kernel ld.so. 

> 
> > 	2) If we go down that path, we should introduce 'virtual devices'
> > 	    like eth[0-n]  etc. in which case the real hardware id of
> > 	    the device is never exported to a real 'user'-process.
> 
> Why?  I would argue that any user-land process that wants to talk to a
> hardware interface, if properly written (if there are APIs available)
> _should_ be able to cope in some sane way with the interface going
> away.
> 
> Most aren't properly written, sure.  So, have your dynamic-event demon
> kill them and restart them if you want that behaviour, or...
> 
> What do virtual devices buy you?

Just convienece. You isoalte the knowledge about the hardware in one
place.

> 
> > > 	* method for direct-config bus device drivers to say to a daemon
> > > 	  "I have this device here, that i've not a clue about.  What can
> > > 	  you do for me."
> > 
> > I would make that passive, eg. have process asking which devices are in 
> > which state (up-and-running, unprobed, probe-failed). The daemon would
> > come in to late anyway.
> 
> "syslog starts up relatively late in the game, but it manages to get
> kernel messages anyway..."

By reading the msgbuffer. dmesg can do that trick 3 days after booting.

The daemon would start after the probe has completed.

> 
> It's not as if there's much state to worry about anyway.  i mean, the
> way I see this working is:
> 
> 	configuration happens, some devices maybe don't get matched.
> 
> 	daemon says "rescan."
> 
> 	bus code passes back the information "I have devices X, Y, and
> 	Z which I can't cope with, here's information about them, what
> 	can you do for me?"
> 
> 	daemon loads some more kernel drivers
> 
> 	daemon says "rescan."

Why rescan after booting? Having a way to get this information without
a kernel debugger, after a boot would be a win anyway.

> 
> etc.  In my world, there's no "sort-of attached," there's only "is" or
> "isn't."  "Isn't" only happens because:
> 
> 	* the kernel wasn't told to attach that type of device there
> 	  ("locator-related" issues)

We should have a flag in confi 'no-auto probe' so conflicting devices
can be configured, but the user has to decide wether he probes it 
or not.

> 
> The biggest sticking point in my mind is the issue of "how do you
> decide that a device is really gone," and what meaning does "hardware
> not there but still attached" have.
> 
> As noted, I think it's ... nonobvious to have attached device stick
> around after the hardware is removed.  "If somebody borrows your
> blender, do you queue jobs for your blender?  Do you try to wash it?"
> 8-) It also leads the situation where a user "just doing the naive

Do you assume that anything that takes it place is a blender too?

> thing" (in an abnormal, but not too weird way) can get into a bad
> situation and have to dig through the manuals or reboot to find out
> what's going on.

This can happen with both methods. Annoying naive user is fairly simple.

> 
> The "still attached" issue is more irksome.  You can't assume that
> settings can be kept over hardware removal; the hardware may not allow
> that.  (e.g. power on the tape drive, it rewinds the tape, your state
> about that tape is now hosed, and the kernel can't know that.)  And

This normally results in SCSI reset, most tapes refuse write commands
until they see a positiong command, if they power up with a tape inserted.

> there are other pathological cases, where e.g. you're plugging a
> PCMCIA card into a DOS box because you want to change some EEPROM
> setting that you can't change from NetBSD.  It's really the same card
> that you took out, but it may be detected differently.  You'll want
> your existing dynamic configuration software solution (e.g. network
> restart scripts) to cope with it, but it might be attached as a
> different interface, etc.
> 
> In a nutshell, keeping devices "still attached" when hardware has been
> removed is non-intuitive, and it adds a bunch of Weird (wrong) semantics.

I may aggree about the semantics, but both ways of handling it
can non-intuitive, depending on expectations. 

How about the following:
    * on a removal the device goes completly
    * the kernel keeps a list of Device-IDs (like proposed) and device where
	it was attached.
    * If the same HW-device comes back it goes to the same device
    * If a different HW-device comes it goes to a different device

That keeps the device is gone semantik
It is a safeguard against bad programs.
This should be intuitive for most people.

Stefan
> 
> 
> cgd
> -- 
> Chris Demetriou - cgd@netbsd.org - http://www.netbsd.org/People/Pages/cgd.html
> Disclaimer: Not speaking for NetBSD, just expressing my own opinion.
> Plug: Get your official NetBSD-1.3.2 CDROM set today! http://www.netbsd.com/

--
Stefan Grefen                                Tandem Computers Europe Inc.
grefen@hprc.tandem.com                       High Performance Research Center
 --- Hacking's just another word for nothing left to kludge. ---