netbsd-help: where should I look first? (esiop error messages)

Subject: where should I look first? (esiop error messages)
To: None <netbsd-help@netbsd.org>
From: Jeff Rizzo <riz@boogers.sf.ca.us>
List: netbsd-help
Date: 02/27/2004 07:42:56
I've got a dual-proc i386 system running a week or so old -current,
and I just did a cvs update on /usr/src, and noticed that I got
some corrupted files.  Poking around a little more, I noticed a
bunch of scsi errors in dmesg (and consequently /var/log/messages).

This system was running stably for about a week; the first indication
of problems are in /var/log/messages from the /etc/daily run yesterday 
morning:

Feb 25 18:00:10 slim syslogd: restart
Feb 26 03:17:36 slim /netbsd: sd0(esiop0:0:0:0): command timeout
Feb 26 03:17:36 slim /netbsd: esiop0: scsi bus reset
Feb 26 03:17:36 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 0 reset
Feb 26 03:17:36 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 1 reset
Feb 26 03:17:36 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 2 reset
Feb 26 03:17:36 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 3 reset
[snip ... lots of similar tag ids]
Feb 26 03:17:38 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 125 reset
Feb 26 03:17:38 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 126 reset
Feb 26 03:17:38 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 127 reset
Feb 26 03:17:38 slim /netbsd: sd0: async, 8-bit transfers, tagged queueing
Feb 26 03:17:38 slim /netbsd: sd1: async, 8-bit transfers, tagged queueing
Feb 26 03:17:38 slim /netbsd: sd0: sync (25.00ns offset 31), 16-bit (80.000MB/s)
 transfers, tagged queueing
Feb 26 03:17:38 slim /netbsd: sd1: sync (25.00ns offset 31), 16-bit (80.000MB/s)
 transfers, tagged queueing
Feb 26 16:00:00 slim syslogd: restart

...then more errors later yesterday:

Feb 26 18:00:11 slim syslogd: restart
Feb 26 21:40:10 slim /netbsd: sd0(esiop0:0:0:0): request sense for a request sen
se ?
Feb 26 21:41:10 slim /netbsd: we read 32 bytes of sense anyway:
Feb 26 21:41:10 slim /netbsd:     SENSE KEY:  No Additional Sense
Feb 26 21:41:10 slim /netbsd: sd0(esiop0:0:0:0): request sense failed with error
 22
Feb 26 21:41:10 slim /netbsd: sd0(esiop0:0:0:0): generic HBA error
Feb 26 21:41:10 slim /netbsd: sd0(esiop0:0:0:0): command timeout
Feb 26 21:41:10 slim /netbsd: esiop0: scsi bus reset
Feb 26 21:41:10 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 0 reset
Feb 26 21:41:10 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 1 reset
Feb 26 21:41:10 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 2 reset
Feb 26 21:41:10 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 3 reset
Feb 26 21:41:10 slim /netbsd: sd0(esiop0:0:0:0): command with tag id 4 reset
Feb 26 21:41:10 slim /netbsd: sd0: async, 8-bit transfers, tagged queueing
Feb 26 21:41:10 slim /netbsd: sd1: async, 8-bit transfers, tagged queueing
Feb 26 21:41:10 slim /netbsd: sd0: sync (25.00ns offset 31), 16-bit (80.000MB/s)
 transfers, tagged queueing
Feb 27 03:15:03 slim /netbsd: sd1: sync (25.00ns offset 31), 16-bit (80.000MB/s)
 transfers, tagged queueing


... and then several pages of command timeouts, bus resets, and generic HBA
errors this morning, when I did the CVS update that led me to discover this.
Obviously, there's a problem.  Is it more likely to be the disk, the 
controller (which is on the motherboard, unfortunately), or cabling?


Clues appreciated...

+j
-- 
Jeff Rizzo                                         http://boogers.sf.ca.us/~riz