Subject: Instability in -current w.r.t. warm boots?
To: None <port-i386@NetBSD.ORG>
From: Greg Earle <earle@isolar.Tujunga.CA.US>
List: port-i386
Date: 06/06/1996 13:45:33
I'm running a -current 1.2_ALPHA (as of a week ago) on a Pentium 120 with an
Adaptec 2940.

It appears that the machine can no longer reliably warm boot anymore.  The
machine has 2 2.1 Gb SCSI drives, one used for DOS/Windoze and the other for
NetBSD.  I have a crontab entry to reboot the system (default boot is to DOS)
at 7 AM each morning, and after everyone's gone home at night, I reboot it
manually into NetBSD for the overnight (so the daily & sup et al. runs occur).

Anyway, the symptom is that the kernel probes the 2 internal disk drives but
when it goes to probe the outboard Exabyte 8200 and Pioneer DRM-640 jukebox,
it hangs out to dry, complaining it can't talk to the Adaptec board.  The red
disk activity (i.e., hooked to the Adaptec) LED on the front of the computer
stays on all the time.  Eventually it gives up and dies with an "integer
divide fault trap".

Perhaps some of the bootup messages will help explain it better.

Here's the SCSI segment of a normal bootup, which pretty much always works as
expected if I power-cycle the PC and leave it off for a few seconds:

NetBSD 1.2_ALPHA (SCIPIO) #0: Fri May 31 06:14:39 PDT 1996
    root@scipio.jpl.nasa.gov:/usr/src/sys/arch/i386/compile/SCIPIO
CPU: Pentium (GenuineIntel 586-class CPU)
real mem  = 16384000
avail mem = 13774848
using 225 buffers containing 921600 bytes of memory
[...]
ahc0 at pci0 dev 5 function 0
ahc0: BurstLen = 4DWDs, Latency Timer = 64PCLKS
ahc0: interrupting at irq 11
ahc0: Reading SEEPROM...done.
ahc0: aic7870 Single Channel, SCSI Id=7, 16 SCBs
ahc0: Reseting Channel A
ahc0: Downloading Sequencer Program...Done
scsibus0 at ahc0
ahc0: target 0 synchronous at 10.0MHz, offset = 0xf
sd0 at scsibus0 targ 0 lun 0: <CONNER, CFP2107S  2.14GB, 2B4B> SCSI2 0/direct fi
xed
sd0: 2048MB, 3999 cyl, 10 head, 104 sec, 512 bytes/sec
ahc0: target 1 synchronous at 10.0MHz, offset = 0xf
sd1 at scsibus0 targ 1 lun 0: <SEAGATE, ST32430N, 0300> SCSI2 0/direct fixed
sd1: 2049MB, 3992 cyl, 9 head, 116 sec, 512 bytes/sec
ahc0:A:5: refuses syncronous negotiation.  Using asyncronous transfers
probe(ahc0:5:0): Target Busy
probe(ahc0:5:0): Target Busy
st0 at scsibus0 targ 5 lun 0: <EXABYTE, EXB-8200, 253X> SCSI1 1/sequential remov
able
st0: st0(ahc0:5:0): Target Busy
st0(ahc0:5:0): Target Busy
st0(ahc0:5:0): Target Busy
drive empty
ahc0: target 6 synchronous at 5.0MHz, offset = 0xf
cd0 at scsibus0 targ 6 lun 0: <PIONEER, CD-ROM DRM-600, 2403> SCSI1 5/cdrom remo
vable
cd1 at scsibus0 targ 6 lun 1: <PIONEER, CD-ROM DRM-600, 2403> SCSI1 5/cdrom remo
vable
cd2 at scsibus0 targ 6 lun 2: <PIONEER, CD-ROM DRM-600, 2403> SCSI1 5/cdrom remo
vable
cd3 at scsibus0 targ 6 lun 3: <PIONEER, CD-ROM DRM-600, 2403> SCSI1 5/cdrom remo
vable
cd4 at scsibus0 targ 6 lun 4: <PIONEER, CD-ROM DRM-600, 2403> SCSI1 5/cdrom remo
vable
cd5 at scsibus0 targ 6 lun 5: <PIONEER, CD-ROM DRM-600, 2403> SCSI1 5/cdrom remo
vable

Other than the somewhat annoying "Target Busy" messages, all is well.

Now, here's what (usually) happens if I try a warm (re)boot:

NetBSD 1.2_ALPHA (SCIPIO) #0: Fri May 31 06:14:39 PDT 1996
    root@scipio.jpl.nasa.gov:/usr/src/sys/arch/i386/compile/SCIPIO
CPU: Pentium (GenuineIntel 586-class CPU)
real mem  = 16384000
avail mem = 13774848
using 225 buffers containing 921600 bytes of memory
[...]
ahc0 at pci0 dev 5 function 0
ahc0: BurstLen = 4DWDs, Latency Timer = 64PCLKS
ahc0: interrupting at irq 11
ahc0: Reading SEEPROM...done.
ahc0: aic7870 Single Channel, SCSI Id=7, 16 SCBs
ahc0: Reseting Channel A
ahc0: Downloading Sequencer Program...Done
scsibus0 at ahc0
ahc0: target 0 synchronous at 10.0MHz, offset = 0xf
sd0 at scsibus0 targ 0 lun 0: <CONNER, CFP2107S  2.14GB, 2B4B> SCSI2 0/direct fi
xed
sd0: 2048MB, 3999 cyl, 10 head, 104 sec, 512 bytes/sec
ahc0: target 1 synchronous at 10.0MHz, offset = 0xf
sd1 at scsibus0 targ 1 lun 0: <SEAGATE, ST32430N, 0300> SCSI2 0/direct fixed
sd1: 2049MB, 3992 cyl, 9 head, 116 sec, 512 bytes/sec
ahc0: board not responding	<===== /* Everything fine up to here */
cmd fail
probe(ahc0:5:0): Timed out in message out phase, SCSISIGI == 0xf6
probe(ahc0:5:0): asserted ATN - device reset in message buffer
ahc0: board not responding
cmd fail
sd2 at scsibus0 targ 5 lun 0: <, , > SCSI0 0/direct fixed
sd2: ahc0: board not responding
ahc0: board not responding
cmd fail
ahc0: board not responding
cmd fail
kernel: integer divide fault trap, code=0
Stopped at      _sdattach+0xd9: divl    %esi,%eax
db> 

Notice the difference after the sd1 probe has completed.  The drive activity
LED goes on and stays stuck on, as mentioned.  At first I thought that DOS
was doing something weird and leaving things in a bad state, but after a cold
boot (worked OK), a warm boot back into NetBSD (worked OK again), and then
*another* warm boot back into NetBSD (hung as per above), I'm convinced it's
neither DOS's fault nor the Exabyte's fault (before this most recent kernel,
the Exabyte has never given a problem during probe/attach, modulo the "Device
Busy" bit).

What I can't tell is whether the hardware (the Adaptec, or ???) is really going
flaky, or whether it's something in the recent SCSI code (what with all the
ahc* jumbling that's been going on) that's causing this to happen.  Any
comments, Justin?  Anyone?

Just for pedantry here's the trace from the fault trap:

db> trace
_sdattach(f86e9000, f86f4600, f8224ca4, f86eccb0, f8224ca4) at _sdattach + 0xd9
_config_attach(...) at _config_attach + 0x10c
_scsi_probedev(...) at _scsi_probdev + 0x1a4
_scsi_probe_bus(...) at _scsi_probe_bus + 0x9e
_scsibusattach(...) at _scsibusattach + 0x32
_config_attach(...) at _config_attach + 0x10c
_config_found_sm(...) at ...
_ahc_attach(...) at ...
_ahc_pci_attach(...) at ...
_config_attach(...) at ...
_config_found_sm(...) at ...
_pciattach(...) at ...
_config_attach(...) at ...
_config_found_sm(...) at ...
_mainbus_attach(...) at ...
_config_attach(...) at ...
_config_rootfound(...) at ...
_configure(...) at _configure + 0x14
_cpu_startup(...) at _cpu_startup + 0x282
_main() at _main() + 0x40

Thanks for any insights.

	- Greg

P.S. Please respond/cc: to me directly as I'm not on "port-i386" ...