Subject: Re: 1.3 broken
To: Chris G. Demetriou <cgd@pa.dec.com>
From: Jukka Marin <jmarin@pyy.jmp.fi>
List: port-i386
Date: 01/31/1998 23:43:00
On Sat, Jan 31, 1998 at 01:23:24PM -0800, Chris G. Demetriou wrote:
> > 1) reboot command doesn't reboot the system.  I see this on two different
> >    pc's.  The machines usually hang after syslogd is killed, but num lock
> >    key still works (num lock led blinks if I press num lock), so the machine
> >    isn't completely dead.
> 
> put DDB into a kernel, break into DDB when the system gets into this
> state, and see what the system is doing.

I will try this soon.  Thanks for the hint.

> I've got N machines (3 x86 at home, 1 x86 at work, and a bunch of
> non-x86 systems) which don't show symptoms like this, so my first
> guess would be that it's related to your local configuration.

Well, 1.2 never did this, but 1.3 seems to do it more than 50% of times.

> Are you using any of the 'weird' file systems (any of the code under
> miscfs other than that in deadfs, genfs, or specfs)?

No.

> Are you mounting
> file systems over NFS in such a way that after the 'shutdown' has
> killed many processes you'd no longer be able to talk to your servers?

I do mount things like /home using NFS, but / and /usr are always local.
Well, the NFS server has the same reboot problem and it doesn't mount
any NFS disks from other machines, so I don't think it's a NFS thing after
all..

> > 2) The IDE driver is definately broken.  One system A, it doesn't detect
> >    the primary hard disk on the first IDE port if I have an ATAPI CD-ROM
> >    attached to the same port.  On system B, I have an IDE disk on the primary
> >    IDE interface and an ATAPI drive on the second one (as a slave).  Most
> >    of the time, the kernel sees some imaginary wdc1 drive at the secondary
> >    controller.  Then the kernel notices that the drive isn't working
> >    properly and keeps retrying and the machine never comes up.
> 
> I don't really know what's going on here. (This sounds completely
> different from the interrupt-related problems which have been
> reported.)

Yeah, it's different from those..

> If you can get one (I don't know that any are built and waiting to be
> downloaded, and I can't easily create one), you might try a -current
> boot floppy to see if it does any better.  The wdc probe code works a
> bit better in -current.

Phew, when I finally get these machines working (and upgrade three more ;)
I'm not sure if I want to start all over again.. ;)  But yeah, I could
try a -current kernel on one machine.

> I assume that all of your drives are correctly jumpered for master or
> slave, as appropriate?

I believe so - at least they worked just fine under 1.2.

> > 3) The com driver is unreliable.  My PPP connection to the world died
> >    suddenly and pppd reported "serial link appears to be disconnected",
> >    but the pppd process never exited like it should.  Instead, the serial
> >    port locked up just as if no interrupts were no longer generated.
> >    I verified this with kermit.
> 
> I've not had these problems on my one system which uses PPP, but that
> could just be an anomaly.  (works bloody well for me, though...  the
> machine in question's been up for a while, and has only gone down when
> I screwed up the packet filter rules so badly that I had to reboot it.

I believe that when I was using 4 com ports that shared an interrupt
the ports were interacting and losing interrupts or something like that.
I had to configure four "normal" com ports to get the TCOM card ports
going while my tcom driver was broken.

I haven't had this problem since I moved my PPP line to a com port on
the motherboard, with its own irq.

Could it be that irq sharing between com ports isn't working reliably?
That's how it seems to me, at least.

> This worked under 1.2?  What 'new features' does 1.3 trigger on your
> machine (look at dmesg output).  Anything like isapnp?

I haven't tried sharing interrupts using the regular com driver before,
so I don't know if this would work under 1.2.

> "interrupts no longer generated" sounds like an interrupt conflict.

I think I have heard that sharing irq's between serial ports should work..
Should it? :)

> > 4) Why has the keymap changed so that I can no longer make the del/bs
> >    keys work properly in mg?  I'm using pcvt.
> 
> I dunno; i use pccons.  sorry.  8-)

This problem went away - it might have been in the 1.3beta kernel that
I was using.  I'll try upgrading the kernel on the other system as well
and see if the problem goes away there too..  (If not, my wife will
probably kill me.. ;-)

> > 5) I can't seem to make the tcom serial driver work - it doesn't receive
> >    any interrupts from the hardware.  I would appreciate it if someone
> >    more knowledgeable could help me out.  My only Internet connection
> >    depends on this driver.
> 
> "What is tcom?"  (Custom driver?)

tcom is my own driver for the Taiwanese TCOM serial cards (4 and 8 port).

> Obviously, this is a driver recompile, but assuming all else works and
> the driver seems to work otherwise, this could be an interrupt problem
> as well...

I used the ast driver when creating the tcom driver (the cards are
pretty much alike).  After posting this message I ran diff on tcom.c
and boca.c and noticed that boca.c was setting ca.ca_noien to 0
(ast.c sets it to 1) and "noien" sounds like "NO Interrupt ENable"
to me, so I changed it to 0 in tcom.c - and it worked.

> What was the last version of the kernel that it worked with?  What
> modifications had to be made to make it compile with 1.3?

I started all over again in 1.3 because the driver is pretty simple
and the bus access stuff had changed so much in 1.3.

Anyway, the driver seems to work now.  I haven't loaded it too much yet,
but it didn't receive any interrupts before and now it does, so I'm
happy for now..

Well, amanda ceased working.  Don't know if something in dump has
changed and broken amanda... :-I

Thanks to all ppl who helped me.  If someone needs the tcom driver,
just drop me a note..

  -jm