disk-error provoked crash, xen with raidframe, maybe

To: current-users%netbsd.org@localhost
Subject: disk-error provoked crash, xen with raidframe, maybe
From: Greg Troxel <gdt%ir.bbn.com@localhost>
Date: Sat, 04 Jul 2009 07:52:51 -0400

I have a box which is runing xen3/amd64 with two 750G SATA drives in
RAID-1 via raidframe.  It's running 4.99.72, but has been completely
stable.  It crashed Thursday, and again last night.  I see a number of
errors that look like the first block below, but every single one of
them is wd1, and they tend to be writes but there are some reads.

Jul  3 02:05:29 foo /netbsd: piixide1:0:0: lost interrupt
Jul  3 02:05:29 foo /netbsd: type: ata tc_bcount: 2048 tc_skip: 0
Jul  3 02:05:29 foo /netbsd: piixide1:0:0: bus-master DMA error: missing 
interrupt, status=0x21
Jul  3 02:05:29 foo /netbsd: piixide1:0:0: device timeout, c_bcount=2048, 
c_skip0
Jul  3 02:05:29 foo /netbsd: wd1a: device timeout reading fsbn 1089123756 of 
1089123756-1089123759 (wd1 bn 1089123819; cn 1080479 tn 15 sn 42), retrying

Jul  3 03:35:54 foo /netbsd: piixide1:0:0: lost interrupt
Jul  3 03:35:54 foo /netbsd: type: ata tc_bcount: 16384 tc_skip: 0
Jul  3 03:35:54 foo /netbsd: piixide1:0:0: bus-master DMA error: missing 
interrupt, status=0x21
Jul  3 03:35:54 foo /netbsd: piixide1:0:0: device timeout, c_bcount=16384, 
c_skip0
Jul  3 03:35:54 foo /netbsd: wd1a: device timeout writing fsbn 1079719072 of 
1079719072-1079719103 (wd1 bn 1079719135; cn 1071149 tn 14 sn 61), retrying
Jul  3 03:35:54 foo /netbsd: wd1: soft error (corrected)


Jul  4 02:05:36 foo /netbsd: piixide1:0:0: lost interrupt
Jul  4 02:05:36 foo /netbsd: type: ata tc_bcount: 16384 tc_skip: 0
Jul  4 02:05:36 foo /netbsd: piixide1:0:0: bus-master DMA error: missing 
interrupt, status=0x21
Jul  4 02:05:36 foo /netbsd: piixide1:0:0: device timeout, c_bcount=16384, 
c_skip0
Jul  4 02:05:36 foo /netbsd: wd1a: device timeout writing fsbn 1077487808 of 
1077487808-1077487839 (wd1 bn 1077487871; cn 1068936 tn 6 sn 5), retrying

Jul  4 02:05:37 foo /netbsd: wd1: soft error (corrected)
Jul  4 02:49:49 foo ntpd[701]: kernel time sync status change 6001
Jul  4 03:06:52 foo ntpd[701]: kernel time sync status change 2001

Jul  4 03:25:24 foo syslogd: restart
Jul  4 03:25:24 foo /netbsd: panic: kernel diagnostic assertion "(l->l_pflag & 
LP_INTR) == 0" failed: file "/n0/netbsd-current/src/sys/kern/kern_synch.c", 
line 189
Jul  4 03:25:24 foo /netbsd: Begin traceback...
Jul  4 03:25:24 foo /netbsd: copyright() at 0xffffffff808ad399
Jul  4 03:25:24 foo /netbsd: fatal page fault in supervisor mode
Jul  4 03:25:24 foo /netbsd: trap type 6 code 0 rip ffffffff804b9354 cs e030 
rflags 10246 cr2  a8 cpl 6 rsp ffffa000127137a0
Jul  4 03:25:24 foo /netbsd: panic: trap
Jul  4 03:25:24 foo /netbsd: Faulted in mid-traceback; aborting...
Jul  4 03:25:24 foo /netbsd: dump to dev 18,1 not possible
Jul  4 03:25:24 foo /netbsd: panic: wdc_exec_command: polled command not done
Jul  4 03:25:24 foo /netbsd: Faulted in mid-traceback; aborting...
Jul  4 03:25:24 foo /netbsd: dump to dev 18,1 not possible
Jul  4 03:25:24 foo /netbsd: rebooting...

Jul  4 04:58:15 foo /netbsd: panic: kernel diagnostic assertion "(l->l_pflag & 
LP_INTR) == 0" failed: file "/n0/netbsd-current/src/sys/kern/kern_synch.c", 
line 189
Jul  4 04:58:15 foo /netbsd: Begin traceback...
Jul  4 04:58:15 foo /netbsd: copyright() at 0xffffffff808ad399
Jul  4 04:58:15 foo /netbsd: fatal page fault in supervisor mode
Jul  4 04:58:15 foo /netbsd: trap type 6 code 0 rip ffffffff804b9354 cs e030 
rflags 10246 cr2  a8 cpl 6 rsp ffffa000127137a0
Jul  4 04:58:15 foo /netbsd: panic: trap
Jul  4 04:58:15 foo /netbsd: Faulted in mid-traceback; aborting...
Jul  4 04:58:15 foo /netbsd: dump to dev 18,1 not possible
Jul  4 04:58:15 foo /netbsd: panic: wdc_exec_command: polled command not done
Jul  4 04:58:15 foo /netbsd: Faulted in mid-traceback; aborting...
Jul  4 04:58:15 foo /netbsd: dump to dev 18,1 not possible
Jul  4 04:58:15 foo /netbsd: rebooting...


I have a new disk on order, but I think there's more wrong.  I know I
need to upgrade to 5.0.

It seems the disk is having problems, and is taking too long to complete
operations.  I've never been clear on how long we wait for disks, and if
there's a spec about how long disk operations can take (assuming healthy
controller and failing platter, which is what I think I have).  If our
error path resets the disk and retries, premature giving up is probably ok.

In the log above, it seems that sometimes there is recovery and
sometimes a panic.  I don't know if raidframe is being ungraceful when
odd things happen on a component, but it really doesn't seem like
raidframe is at fault.  I also don't think xen plays a role here.

So besides fix the disk and upgrade, any advice?

Attachment: pgp0oUUFyDLjJ.pgp
Description: PGP signature

Prev by Date: Re: Hang after ahcisata
Next by Date: Re: Hang after ahcisata
Previous by Thread: james's loan offers@3% interest
Next by Thread: openssl fails to compile
Indexes:

Home | Main Index | Thread Index | Old Index