netbsd-users: Compaq "Smart" Array controllers

Subject: Compaq "Smart" Array controllers
To: None <netbsd-users@netbsd.org>
From: None <netbsd-ml@amalgam.dyndns.org>
List: netbsd-users
Date: 06/02/2003 13:01:34
My apologies for the length, but I wanted to be thorough in my report.

A report of this problem was originally placed on current-users, but
I have since discovered that it is not just one machine, and that it
happens on both the 1.6.1 release, as well as current.

I have been getting the following errors under even moderately heavy 
loads on two servers I built recently:

ld0f: error writing fsbn 90048 of 90048-90063 (ld0 bn 5597760; cn 1388 tn 21 sn 21)
ld0f: error writing fsbn 90048 of 90048-90063 (ld0 bn 5597760; cn 1388 tn 21 sn 21)
ld0: dk_busy < 0
panic: disk_unbusy
stopped in pid 227 (tar) at    cpu_debugger+0x4:      leave
stopped in pid 227 (tar) at    cpu_debugger+0x5:      ret
stopped in pid 227 (tar) at    panic+0xad:    jmp   panic+0x118
stopped in pid 227 (tar) at    panic+0x118:   addl   $-0x8,%esp
stopped in pid 227 (tar) at    panic+0x11b:   pushl  $0

obviously the addresses and block numbers change each time, but this is 
pretty much the signature of the crash.

The other panic I have seen in the same conditions is:
panic: biodone already


Hardware
Server 1:
Compaq Proliant 1850R (PIII 600)  128MB RAM
Compaq Smart Array 3200
4 x 9.1 GB Ultra2 SCSI HD (Tried RAID 0+1 and also RAID 5)

Server 2:
Compaq Proliant 1600 (PII 450)  128MB RAM
Compaq Smart Array 2/SL
5 x 4.3 GB Ultra2 SCSI HD (Both RAID 0+1 and RAID 5 have been tried)

I have tried these servers with 
	-1.6.1 and current.  
	-With Array acceleration enabled and disabled.  

But, seemingly regardless of what I try, under any moderate disk activity
the above errors pop up, and the server folds.

I have one other server, a Proliant 2500 + Smart 2/DH, that has not
had any problems since installation last week, so I do not think it
is my installation approach[2], but I am open to any suggestions.  

So far, my searching has only turned up a similar problem [1] with
a mylix RAID card.  But the cause of that problem is supposed to be 
in the mlx.c driver not the cac driver I am using.

Can anyone offer any enlightenment?  Is this my mistake, or a bug?

Trace and PS info are attached below,


Michael


[1] http://mail-index.netbsd.org/current-users/2003/05/03/0003.html

[2] To be completely fair, this server does not see much in the way of
disk activity, so it could conceivably have the same problem only it 
has not been in a situation to be affected by it yet.

Trace and PS output after crash:

panic: biodone already
Stopped at    cpu_Debugger+0x4:		leave
db> trace
cpu_Debugger(c4aa9488,6,ca9a2e40,c017b4f6,c3aa9488) at cpu_Debugger+0x4
panic(c0546622,c0a26200,c0a262b0,c0793ddc,c3aa9488) at panic+0xb8
biodone(c3aa9488,2000,100000,c0793ddc,c0a26200) at biodone+0x35
ddoneac3aa9488,c0793e08,c01b2c9c,c3aa9488) at lddone+0x05
ld_cac_done(c0a26200,c3aa948,8,0,c01b2b2a,c09dda00) at ld_cac_done+0xc5
cac_ccb_done(c09dda00,ca9a2e40,c0793e68,0,c0a23e40) at cac_ccb_done+0x9f
cac_intr(c09dda00,0,c0790010,30,c0100010) at cac_intr+0x2a
Xintr_legacy10() at Xintr_legacy10+0xa8
--- interrupt ---
mpidle(c06d9560,0,c0793f6c,0,80000000) at mpidle
ltsleep(c06d93a0,4,c054de46,0,0) at ltsleep+0x207
gvm_scheduler(c078f010,78f000,798000,0,0) at gvm_scheduler+0x75
main(0,0,0,0,0) at main+0x69e
db> ps
PID		PPID		PGRP		UID	S	FLAGS		LWPS	COMMAND	WAIT
447		446		446		0	2	0x4002		1	gzip	pipdwt
446		364		446		0	2	0x4002		1	tar	biowait
380		409		409		0	2	0x4002		1	gzip	pipdwt
409		342		409		0	2	0x4002		1	tar	biowait
405		368		405		0	2	0x4002		1	rm	biowait
375		1		375		0	2	0x4002		1	getty	ttyin
364		1		364		0	2	0x4003		1	csh	pause
342		1		342		0	2	0x4003		1	csh	pause
368		1		368		0	2	0x4003		1	csh	pause
344		1		344		0	2	0		1	cron	nanosic
334		1		334		0	2	0		1	inetd	kqread
175		1		175		0	2	0		1	syslogd	biowait
125		1		125		0	2	0		1	dhclient select
10		0		0		0	2	0x20200	1	aiodoned	aiodone
9		0		0		0	2	0x20200	1	ioflush
8		0		0		0	2	0x20200	1	reaper	reaper
7		0		0		0	2	0x20200	1	pagedaemon	pgdaemo
6		0		0		0	2	0x20200	1	ifs_writer	ifswrit
5		0		0		0	2	0x20200	1	pms0	pmsrese
4		0		0		0	2	0x20200	1	atapibus0	sccomp
3		0		0		0	2	0x20200	1	scsibus1	sccomp
1		0		1		0	2	0x4000		1	init		wait
0		-1		0		0	2	0x20200	1	swapper 	schedule
db>