I have a box which is runing xen3/amd64 with two 750G SATA drives in RAID-1 via raidframe. It's running 4.99.72, but has been completely stable. It crashed Thursday, and again last night. I see a number of errors that look like the first block below, but every single one of them is wd1, and they tend to be writes but there are some reads. Jul 3 02:05:29 foo /netbsd: piixide1:0:0: lost interrupt Jul 3 02:05:29 foo /netbsd: type: ata tc_bcount: 2048 tc_skip: 0 Jul 3 02:05:29 foo /netbsd: piixide1:0:0: bus-master DMA error: missing interrupt, status=0x21 Jul 3 02:05:29 foo /netbsd: piixide1:0:0: device timeout, c_bcount=2048, c_skip0 Jul 3 02:05:29 foo /netbsd: wd1a: device timeout reading fsbn 1089123756 of 1089123756-1089123759 (wd1 bn 1089123819; cn 1080479 tn 15 sn 42), retrying Jul 3 03:35:54 foo /netbsd: piixide1:0:0: lost interrupt Jul 3 03:35:54 foo /netbsd: type: ata tc_bcount: 16384 tc_skip: 0 Jul 3 03:35:54 foo /netbsd: piixide1:0:0: bus-master DMA error: missing interrupt, status=0x21 Jul 3 03:35:54 foo /netbsd: piixide1:0:0: device timeout, c_bcount=16384, c_skip0 Jul 3 03:35:54 foo /netbsd: wd1a: device timeout writing fsbn 1079719072 of 1079719072-1079719103 (wd1 bn 1079719135; cn 1071149 tn 14 sn 61), retrying Jul 3 03:35:54 foo /netbsd: wd1: soft error (corrected) Jul 4 02:05:36 foo /netbsd: piixide1:0:0: lost interrupt Jul 4 02:05:36 foo /netbsd: type: ata tc_bcount: 16384 tc_skip: 0 Jul 4 02:05:36 foo /netbsd: piixide1:0:0: bus-master DMA error: missing interrupt, status=0x21 Jul 4 02:05:36 foo /netbsd: piixide1:0:0: device timeout, c_bcount=16384, c_skip0 Jul 4 02:05:36 foo /netbsd: wd1a: device timeout writing fsbn 1077487808 of 1077487808-1077487839 (wd1 bn 1077487871; cn 1068936 tn 6 sn 5), retrying Jul 4 02:05:37 foo /netbsd: wd1: soft error (corrected) Jul 4 02:49:49 foo ntpd[701]: kernel time sync status change 6001 Jul 4 03:06:52 foo ntpd[701]: kernel time sync status change 2001 Jul 4 03:25:24 foo syslogd: restart Jul 4 03:25:24 foo /netbsd: panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0" failed: file "/n0/netbsd-current/src/sys/kern/kern_synch.c", line 189 Jul 4 03:25:24 foo /netbsd: Begin traceback... Jul 4 03:25:24 foo /netbsd: copyright() at 0xffffffff808ad399 Jul 4 03:25:24 foo /netbsd: fatal page fault in supervisor mode Jul 4 03:25:24 foo /netbsd: trap type 6 code 0 rip ffffffff804b9354 cs e030 rflags 10246 cr2 a8 cpl 6 rsp ffffa000127137a0 Jul 4 03:25:24 foo /netbsd: panic: trap Jul 4 03:25:24 foo /netbsd: Faulted in mid-traceback; aborting... Jul 4 03:25:24 foo /netbsd: dump to dev 18,1 not possible Jul 4 03:25:24 foo /netbsd: panic: wdc_exec_command: polled command not done Jul 4 03:25:24 foo /netbsd: Faulted in mid-traceback; aborting... Jul 4 03:25:24 foo /netbsd: dump to dev 18,1 not possible Jul 4 03:25:24 foo /netbsd: rebooting... Jul 4 04:58:15 foo /netbsd: panic: kernel diagnostic assertion "(l->l_pflag & LP_INTR) == 0" failed: file "/n0/netbsd-current/src/sys/kern/kern_synch.c", line 189 Jul 4 04:58:15 foo /netbsd: Begin traceback... Jul 4 04:58:15 foo /netbsd: copyright() at 0xffffffff808ad399 Jul 4 04:58:15 foo /netbsd: fatal page fault in supervisor mode Jul 4 04:58:15 foo /netbsd: trap type 6 code 0 rip ffffffff804b9354 cs e030 rflags 10246 cr2 a8 cpl 6 rsp ffffa000127137a0 Jul 4 04:58:15 foo /netbsd: panic: trap Jul 4 04:58:15 foo /netbsd: Faulted in mid-traceback; aborting... Jul 4 04:58:15 foo /netbsd: dump to dev 18,1 not possible Jul 4 04:58:15 foo /netbsd: panic: wdc_exec_command: polled command not done Jul 4 04:58:15 foo /netbsd: Faulted in mid-traceback; aborting... Jul 4 04:58:15 foo /netbsd: dump to dev 18,1 not possible Jul 4 04:58:15 foo /netbsd: rebooting... I have a new disk on order, but I think there's more wrong. I know I need to upgrade to 5.0. It seems the disk is having problems, and is taking too long to complete operations. I've never been clear on how long we wait for disks, and if there's a spec about how long disk operations can take (assuming healthy controller and failing platter, which is what I think I have). If our error path resets the disk and retries, premature giving up is probably ok. In the log above, it seems that sometimes there is recovery and sometimes a panic. I don't know if raidframe is being ungraceful when odd things happen on a component, but it really doesn't seem like raidframe is at fault. I also don't think xen plays a role here. So besides fix the disk and upgrade, any advice?
Attachment:
pgp0oUUFyDLjJ.pgp
Description: PGP signature