Subject: Re: wd drive in go-slow mode
To: Paul Ripke <stix@stix.homeunix.net>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: netbsd-users
Date: 04/27/2004 21:20:56
--T4sUOijqQbZv57TR
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Tue, Apr 27, 2004 at 04:39:51PM +1000, Paul Ripke wrote:
> I have a box with a three-disk RAIDframe RAID5 set. Two of the disks
> are exactly the same make/model, run off separate busses on the same
> Silicon Image controller. I have recently noticed that after a
> reported error:
> 
> Apr 15 18:24:48 stix-pc /netbsd: cmdide0:1:0: lost interrupt
> Apr 15 18:24:48 stix-pc /netbsd:        type: ata tc_bcount: 16384 
> tc_skip: 0
> Apr 15 18:24:48 stix-pc /netbsd: cmdide0:1:0: bus-master DMA error: 
> missing interrupt, status=0x21
> Apr 15 18:24:48 stix-pc /netbsd: cmdide0:1:0: device timeout, 
> c_bcount=16384, c_skip0
> Apr 15 18:24:48 stix-pc /netbsd: wd4a: device timeout writing fsbn 
> 18427840 of 18427840-18427871 (wd4 bn 18427840; cn 18
> 281 tn 9 sn 25), retrying
> Apr 15 18:24:51 stix-pc /netbsd: wd4: soft error (corrected)
> 
> the drive seems to be in go-slow mode:
> 
> ksh$ dd if=/dev/rwd3d of=/dev/null bs=64k count=16k
> 16384+0 records in
> 16384+0 records out
> 1073741824 bytes transferred in 30.988 secs (34650246 bytes/sec)
> ksh$ dd if=/dev/rwd4d of=/dev/null bs=64k count=16k
> 16384+0 records in
> 16384+0 records out
> 1073741824 bytes transferred in 129.220 secs (8309408 bytes/sec)
> 
> This seems to affect random I/O even more.

Hum, maybe this drive has a problem. If the transfer mode had been changed,
a message would have been printed before the "soft error (corrected)".

BTW, you may want to try the attached program, which tests the speed of the
bus by looping on data already in the drive's cache.

Usage: ./tst /dev/rwd4d 10000
(or more if 10000 is too low to give accurate results).

> 
> There are no recent changes in the relevant source files between
> the running kernel (1.6ZI) and 2.0_BETA. Just wondering if anyone
> has seen this or something similar - and what to do about it. As
> yet, I have only rebooted the box. Shortly after reboot, it
> generated a similar error again, with the same consequences.
> I'll shortly be pulling the top off, and checking the cabling.
> Might even switch wd3/4 (joys of RAIDframe :) ) and see what happens.
> BTW: is there a way to interrogate the current drive transfer mode?

Yes, in the kernel config file. See wd(4).

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--

--T4sUOijqQbZv57TR
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="tst.c"

#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>

main(int argc, char **argv)
{
	static char buf[64*1024];
	int fd, i;
	struct timeval tv0, tv1;
	int t;

	fd = open(argv[1], O_RDONLY, 0);
	if (fd < 0) {
		perror("open");
		exit(1);
	}
	if (gettimeofday(&tv0, NULL) < 0) {
		perror("gettimeofday");
		exit(1);
	}
	for (i = 0; i < atoi(argv[2]); i++) {
		if (read(fd, buf, sizeof(buf)) != sizeof(buf)) {
			perror("read");
			exit(1);
		}
		if (lseek(fd, 0, SEEK_SET) < 0) {
			perror("seek");
			exit(1);
		}
			
	}
	if (gettimeofday(&tv1, NULL) < 0) {
		perror("gettimeofday");
		exit(1);
	}
	t = (tv1.tv_sec - tv0.tv_sec) * 1000000;
	t = t + tv1.tv_usec - tv0.tv_usec;
	printf("%d us, %f MB/s\n", t,
	    ((double)64 * (double)i / 1024) / ((double)t / 1000000));
	exit(0);
}

--T4sUOijqQbZv57TR--