Subject: Re: netbsd-1-6 branch vs. recent esp(4) fixes....
From: Brian Buhrow
Date: 10/24/2002 10:57:04
	Hello Greg.  This is a bit off-topic, but whenever I've seen
external drives start going on and off-line when under heavy load, it has
usually been because the power supply in the external enclosure is failing.
 Apparently, the seeks and read/write activity can drive the electrical
load high enough for these marginal supplies to cause voltages to drop to
disk-drive resetting levels.  Have you verified that you're not about to
lose a power supply in this enclosure?
On Oct 24,  4:55pm, Greg A. Woods wrote:
} Subject: Re: netbsd-1-6 branch vs. recent esp(4) fixes....
} [ On Thursday, October 24, 2002 at 13:13:45 (+0200), Martin Husemann wrote: ]
} > Subject: Re: netbsd-1-6 branch vs. recent esp(4) fixes....
} >
} > On Wed, Oct 23, 2002 at 06:58:41PM -0400, Greg A. Woods wrote:
} > 
} > > Has there been enough experience yet to know if the more recent fixes to
} > > esp(4) (i.e. sys/dev/ic/ncr53c9x.c et al) are well enough tested to be
} > > pulled up to the netbsd-1-6 branch yet?
} > 
} > As a data point:
} > 
} > I've been using my U2 with tagged queuing enabled (and one single not
} > yet commited patch from Andrey) for about a week now, beating on it
} > pretty hard. This setup always lost when tagged queueing was enabled before,
} > but now it did survive.
} Yes, I'm fairly happy with the tagged queuing in the esp(4) driver too
} on a sparc-20 clone with a pair of drives which both support tagged
} queuing.
} It's just that I have this less than perfectly reliable ST32430N (with
} version 0510 firmware) in an external box which I use for the local
} /var/obj and /var/packages-obj and it keeps taking itself offline when
} under really heavy load (untarring big archives with big files, linking
} really big programs such as the kernel, installing really big packages
} such as mozilla, etc.):
} sd1(esp0:0:1:0): esp0: timed out [ecb 0xf093e388 (flags 0x1, dleft 800, stat 0)], <state 1, nexus 0x0, phase(l 10, c 100, p 3), resid 800, msg(q 0,o 0) >
} and of course then without the bus reset fixes it won't ever be co-erced
} back online from the driver's point of view even though it'll re-probe
} just fine from OFW.
} Unfortunately when a drive goes offline like this with a process holding
} files open on both it and the other drive then you can't do anything at
} all with either drive (eg. unmount filesystems, etc.) so you just have
} to drop to OFW and reboot and hope for the best.  That's why I'm hoping
} the bus reset fixes let me bring the drive back online relatively
} cleanly.
} (I'm also hoping the bus reset fixes will allow me to hot-swap an SCA
} drive in an SS5/SS10/SS20 so that I can build a more reliable server
} using RAIDframe on the root drives.  Currently doing that has dire
} consequences for the driver just like having a drive go offline.)
} (maybe the real problem is that the ST32430N does have a tagged queuing
} bug, but if so then it's not a complete failure -- just a rare one.... ;-)
Greg A. Woods
>-- End of excerpt from Greg A. Woods