I have a NetBSD box running 6.0.1 i386. It has four 3TB HDs with two raidframe raid arrays configured.
The first raid array is a raid0 for / (currently over wd0 and wd3, using 5GB on each disk), the second a raid5 for a data partition (over wd0, wd1, wd2, wd3 using all remaing space (reporting 8TB in total)).
A week ago the system became unresponsive with many errors like the following in /var/log/messages:
Jun 2 14:50:31 ex-fl-sr-03 /netbsd: wd1d: error reading fsbn 5804545856 of 5804545856-5804545983 (wd1 bn 5804545856; cn 5758478 tn 0 sn 32), retrying
Jun 2 14:50:31 ex-fl-sr-03 /netbsd: wd1: (uncorrectable data error)
Jun 2 14:50:31 ex-fl-sr-03 /netbsd: ahcisata0 port 1: device present, speed: 3.0Gb/s
At that point the / raid1 was running on wd0 and wd1 and had the component running on wd1 listed as failed. I added a preprepared partition on wd3 to that mirror and rebuilt it. At present both the part on wd0 and wd3 are reporting as optimal.
The odd part was that raidframe had listed the part of the raid5 data partition on wd0 as failed (the errors in /var/log/messages only ever referred to wd1) and the part on wd1 as optimal.
I reseated the drives, rebooted the system and all the drives seemed OK. As there were no errors reported for wd0, and raidframe seemed happy with the part of the raid5 on wd1 I set the array rebuilding on wd0.
Today (5 days later - this are 3TB drives) the rebuild failed at 99%. Again there are errors in /var/log/messages about wd1 (see above). Again the raid5 has failed on the section on wd0 (although in this case it never completed rebuilding). The rebuild failed 17 seconds after these errors started being printed to the log:
Jun 2 14:50:48 ex-fl-sr-03 /netbsd: raid1: Recon read failed: 5
Jun 2 14:50:48 ex-fl-sr-03 /netbsd: raid1: reconstruction failed.
Jun 2 14:50:48 ex-fl-sr-03 /netbsd: ahcisata0 port 1: device present, speed: 3.0Gb/s
My reading of the situation is that raidframe in incorrectly failing the part of the raid5 on wd0 due to read errors on wd1. As there are read errors on the part of the raid array on wd1 (with no redundancy as one member of raid has been failed) I need to get as much of the data off the raid as possible and rebuild from scratch, probably after replacing wd1 as a failed drive.
Do you agree?
Any idea why raidframe seems to be failing the wrong member of the raid5 thus invalidating the whole thing?
Thanks in advance,