port-alpha: bootable RAID-1 array problems

Subject: bootable RAID-1 array problems
To: None <port-alpha@netbsd.org>
From: Ray Phillips <r.phillips@jkmrc.com>
List: port-alpha
Date: 08/20/2004 12:44:54
Recently I've tried unsuccessfully to setup a RAID1 array following 
the instructions at http://www.netbsd.org/guide/en/chap-rf.html. 
First I used a -current system built from CVS sources updated on 26 
July, then again with one built from an update done on 17 August.

I completed this successfully on an i386 machine (using the 26 July 
sources) which worked, so I wonder if there's an alpha-specific 
problem or (more likely) if I've done something wrong?

Things seem to go OK until

# raidctl -F component0 raid0

is executed in section 23.8 The first boot with RAID-1, at which 
point the machine (a PWS 500 with 1 GB RAM) hangs.  I've tried using 
both a pair of identical SCSI and a pair of identical IDE disks.  The 
console output for the SCSI pair at the point of the crash was:

RECON: initiating reconstruction on col 0 -> spare at col 2
sd1(isp0:0:2:0):  Check Condition on CDB: 0x08 00 10 40 80 00
     SENSE KEY:  Hardware Error
      ASC/ASCQ:  ASC 0x44 ASCQ 0x9d

raid0: IO Error.  Marking /dev/sd1a as failed.
raid0: Recon read failed!
panic: raidframe error at line 880 file 
/usr/src/sys/dev/raidframe/rf_reconstruc
Stopped in pid 504.1 (raid_recon) at    netbsd:cpu_Debugger+0x4: 
ret    z
ero,(ra)
db>

and for the IDE pair:

Aug 19 17:11:41 www /netbsd: stray isa irq 14
Warning: truncating spare disk /dev/wd0a to 4127616 blocks
Aug 19 17:12:50 www su: ray to root on /dev/ttyp1
RECON: initiating reconstruction on col 0 -> spare at col 2
wd1a: error reading fsbn 1031488 of 1031488-1031615 (wd1 bn 1031488; 
cn 1023 tng
wd1: (uncorrectable data error)
wd0a: channel reset writing fsbn 1031232 of 1031232-1031359 (wd0 bn 
1031232; cng
wd1a: error reading fsbn 1031488 of 1031488-1031615 (wd1 bn 1031488; 
cn 1023 tng
wd1: (uncorrectable data error)
wd0a: channel reset writing fsbn 1031232 of 1031232-1031359 (wd0 bn 
1031232; cng
wd1a: error reading fsbn 1031488 of 1031488-1031615 (wd1 bn 1031488; 
cn 1023 tng
wd1: (uncorrectable data error)
wd0a: channel reset writing fsbn 1031232 of 1031232-1031359 (wd0 bn 
1031232; cng
wd1a: error reading fsbn 1031488 of 1031488-1031615 (wd1 bn 1031488; 
cn 1023 tng
wd1: (uncorrectable data error)
wd0a: channel reset writing fsbn 1031232 of 1031232-1031359 (wd0 bn 
1031232; cng
wd1a: error reading fsbn 1031581 of 1031488-1031615 (wd1 bn 1031581; 
cn 1023 tng
wd1: (uncorrectable data error)
wd0a: channel reset writing fsbn 1031232 of 1031232-1031359 (wd0 bn 
1031232; cng
wd1a: error reading fsbn 1031581 of 1031488-1031615 (wd1 bn 1031581; 
cn 1023 tn)

raid0: IO Error.  Marking /dev/wd1a as failed.
raid0: Recon read failed!
panic: raidframe error at line 880 file 
/usr/src/sys/dev/raidframe/rf_reconstruc
Stopped in pid 445.1 (raid_recon) at    netbsd:cpu_Debugger+0x4: 
ret    z
ero,(ra)
db>

(I'm afraid long lines don't wrap on the console I used, so they're truncated.)

I suppose I was asking for trouble in the second case since wd1 has ~ 
63 K of bad sectors, but I'm pretty sure they were in the swap 
patition so I thought they wouldn't be relevant.  I've no reason to 
think there was a hardware problem with the SCSI setup.

dmesg probed them as:

sd0 at scsibus0 target 0 lun 0: <DEC, RZ1CF-AF (C) DEC, 1614> disk fixed
sd0: async, 8-bit transfers
sd0: 4091 MB, 3708 cyl, 20 head, 113 sec, 512 bytes/sect x 8380080 sectors
sd0: sync (100.00ns offset 8), 8-bit (10.000MB/s) transfers, tagged queueing
sd1 at scsibus0 target 2 lun 0: <DEC, RZ1CF-AF (C) DEC, 1614> disk fixed
sd1: async, 8-bit transfers
sd1: 4091 MB, 3708 cyl, 20 head, 113 sec, 512 bytes/sect x 8380080 sectors
sd1: sync (100.00ns offset 8), 8-bit (10.000MB/s) transfers, tagged queueing

The disklabels I used were:

sd0 and sd1
-----------
#        size    offset     fstype [fsize bsize cpg/sgs]
  a:   8380080         0       RAID
  b:   1048576    262208       swap
  c:   8380080         0     unused      0     0

raid0
-----
#        size    offset     fstype [fsize bsize cpg/sgs]
  a:    262144         0     4.2BSD      0     0     0
  b:   1048576    262144       swap
  c:   8379904         0     unused      0     0     0
  d:   7069184   1310720     4.2BSD      0     0     0

/etc/fstab
----------
/dev/raid0a     /       ffs     rw      1       1
/dev/raid0b     none    swap    sw      0       0
/dev/raid0d     /usr    ffs     rw      1       2
/dev/sd0b       none    swap    dp      0       0
kernfs          /kern   kernfs  rw
procfs          /proc   procfs  rw,noauto

To install boot blocks I used:

# /usr/sbin/installboot -v /dev/rsd1c /usr/mdec/bootxx_ffs

Can you see anything awry with any of this?


Ray