Subject: Re: kernel crash and scsi-disk/hang
To: None <andrew@wipux2.wifo.uni-mannheim.de>
From: Charles M. Hannum <mycroft@ai.mit.edu>
List: current-users
Date: 01/22/1995 07:57:12
   probe(ncr0:0:0): 225ns (4 Mb/sec) offset 8.
   probe(ncr0:1:0): 200ns (5 Mb/sec) offset 8.
   sd1(ncr0:1:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
   st0(ncr0:5:0): 200ns (5 Mb/sec) offset 8.

Running both fast and slow devices on the same SCSI bus is well known
as a potential source of lossage.  Also, the SCSI 2 spec says:

  IMPLEMENTORS NOTE:  Use of single-ended drivers and receivers with the fast 
  synchronous data transfer option is not recommended.

and the cable requirements are much more strict for fast SCSI.

I'd suggest you try changing SCSI_NCR_MAX_SYNC (in
/sys/arch/i386/pci/ncr.c) to 5000 or 4000 and see if it helps.
(Regardless, this doesn't excuse the kernel crashing, but I expect
that's a bug in the error handling in the NCR driver.)


[The following is a copy of the original message, for Wolfgang's
benefit.  Similar bugs have been reported recently.]

From: Andrew Wheadon <andrew@wipux2.wifo.uni-mannheim.de>
Date: Sun, 22 Jan 1995 11:57:32 +0100 (MET)
Content-Type: text
Content-Length: 7704      

I'm having quite a few kernel crashes recently (about 2-3 per week)
and the accounting always seems to be corrupted when people
are using irc or screen-3.6, so I presume it's crashing when they
are running these programs for some time. I find it a bit frustrating
that when I do pop in to see what has happened the kernel has 
usually rebooted instead of staying in the Debugger to tell me
where it happened. (I have DDB and DIAGNOSTIC options in my configfile)

Anyway it sometimes seems to reboot while the slower of my two scsi-disks
(sd0) is being accessed, at least the light is still on both the 
disk and the controller while the other peripherals (sd1 and st0 are off).

This causes it to fail to find any scsi-peripherals and thus
ask for a boot-disk and then just wait until i come into work
and turn sd0 off and turn it on again, after this it finds 
them nicely on both a hard and a soft-reboot.

I can imagine that sd0 is being accessed too fast, at least 
our Novell-server has the same problem when the speed is set
too high on it's adaptec, but I don't know how to set the ncr
to different speeds for different devices and it seems to do
it itself anyway.

SO:	How do I make the kernel stop in debugger (for ever) 
	instead of rebooting.
	How do I set the speed for different devices on the
	scsi-bus.
	How do I cause the scsi-bus to be reset properly on
	reboot.
	How do I find out which program is crashing my system
	or better still stop programs from being able to crash
	my System, without disabling user-access of the machine ?

I get no coredump when the machine crashes anyway.

Here is dmesg:

NetBSD 1.0A (WIPUX) #0: Fri Jan 20 09:43:25 MET 1995
    toor@wipux2:/src/src/sys/arch/i386/compile/WIPUX
CPU: Pentium (GenuineIntel 586-class CPU)
real mem  = 33161216
avail mem = 29646848
using 430 buffers containing 1761280 bytes of memory
isa0 (root)
npx0 at isa0 port 0xf0-0xff: using exception 16
WARNING: Pentium FDIV bug detected!
vt0 at isa0 port 0x60-0x6f irq 1: et4000, 80/132 col, color, 8 scr, mf2-kbd, [R3.00]
com0 at isa0 port 0x3f8-0x3ff irq 4: ns82450 or ns16450, no fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns82450 or ns16450, no fifo
lpt0 at isa0 port 0x378-0x37f: polled
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec
ed0 at isa0 port 0x300-0x31f iomem 0xdc000-0xdffff irq 10: address 00:00:c0:c0:43:a0, type SMC8216/SMC8216C (16-bit) aui
root device eisa not configured
pci0 (root): configuration mode 2
pci0 bus 0 device 0: identifier 04a38086 class 06000011 not configured
pci0 bus 0 device 2: identifier 04828086 class 00000003 not configured
ncr0 at pci0 bus 0 device 5
pci_map_mem: memory mapped at c0000000-c0000fff
pci_map_int: pin A mapped to line 11
ncr0: restart (scsi reset).
ncr0 scanning for targets 0..6 ($Revision: 1.10 $)
scsibus0 at ncr0
probe(ncr0:0:0): 225ns (4 Mb/sec) offset 8.
ncr0 targ 0 lun 0: <FUJITSU, M2266S-512, 0020> SCSI2 0/direct fixed
sd0 at scsibus0: 1029MB, 1658 cyl, 15 head, 84 sec, 512 bytes/sec
probe(ncr0:1:0): 200ns (5 Mb/sec) offset 8.
ncr0 targ 1 lun 0: <FUJITSU, M2694S-512, 0124> SCSI2 0/direct fixed
sd1 at scsibus0sd1(ncr0:1:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
: 1033MB, 1819 cyl, 15 head, 77 sec, 512 bytes/sec
ncr0 targ 5 lun 0: <HP, HP35470A, 1109> SCSI2 1/sequential removable
st0 at scsibus0: st0(ncr0:5:0): 200ns (5 Mb/sec) offset 8.
drive empty
biomask 840 netmask 41a ttymask 1a
changing root device to sd0a
sd1(ncr0:1:0): sd1(ncr0:1:0):
	^- these two messages appear while fsck is running.

[Config file omitted since it doesn't contain anything `interesting'.]