Subject: Re: raidframe problems (revisited)
To: Greg Troxel <gdt@ir.bbn.com>
From: Louis Guillaume <lguillaume@berklee.edu>
List: netbsd-users
Date: 05/11/2007 09:30:00
Greg Troxel wrote:

> So I can believe that your system doesn't work with raid, but does
> with single disks.  I'd be suspicious of the power supply and your
> memory.

Ok - I've since replaced the memory (brand new) and power supply (at
very far-apart times) and neither has improved the situation.

> To test, I'd mount the underlying filesystems ro, and compare.  I have
> wd0a and wd1a each 63-end, type RAID, and raid0 is partitioned / /var
> /usr /home pretty normally.  I then have wd0 efgh matching raid0 efga.
> This is tricky, but if you add the RAID partition start, the RAID
> offset (64), and the within-raid offset, you get the starting sector
> of one of the copies of the filesystem.  (This only works for RAID-1.)
> Then, after putting files in /home, I can mount wd0g and wd1g and sha1
> the files from the underlying disks.  I found a few different.
> Looking at the differences, the bad version would have 3-20 bytes with
> extra bits set, usually 0200.   This feels more like memory problems
> than raidframe bugs.
> 
> If you can pull each half your memory in turn and retest that would be
> interesting.

Ok here's the path I went down:

After installing a new power supply I fscked, rebuilt the raid, fscked
again and it seemed to be ok. But after a day or so subsequent runs of
fsck showed problems. There was only one stick of RAM in the machine
(with 2 slots). So I decided to stop using raid for the time being and
ordered new RAM, 2 sticks, doubling the memory on the system.

After a couple weeks of flawless performance (non-raid) I put the new
RAM in and tried to rebuild the RAID. Same issues came up. I reverted to
non-raid and had several weeks of good performance.

Now I tried to do the tests with the memory. I built a new RAID set and
copied the data there. Noticed the same issues.

Mounted the partitions and ran the script below. Saw mismatches.
Failed wd1a, fscked. No problems.
Restored files from tape.
Reconstructed the RAID.
Ran fsck (no problems).
Ran the script below *WITHOUT EVER HAVING MOUNTED THE FILE SYSTEM* and
saw the mismatches.

Removed one piece of RAM. Repeated the above. Same issues.

Should I spend the time and check the other piece of RAM? (I can't
imagine why). Maybe the other RAM slot?

Full dmesg is below, maybe there's some problems there. Any further help
would be fantastic. Thanks!

Louis



After mounting the underlying file systems on /tmp/wd[x] I'm using this
report to do the tests. Unfortunately the counters don't add up unless
you use the "real" ksh...

#!/bin/ksh
filecnt=0
badcnt=0
find /tmp/wd0 -type f | while read i ;do
  echo -n .
  (( filecnt+=1 ))
  comp="$(echo $i | sed 's/wd0/wd1/')"
  first=$(sha1 "$i" | sed 's/SHA1 ([^)]*) = //')
  second=$(sha1 "$comp" | sed 's/SHA1 ([^)]*) = //')
  if [ "$first" != "$second" ] ; then
    (( badcnt+=1 ))
    echo Mismatch!
    echo File \`${i}\'
    echo " vs." \`${comp}\'
    echo wd0 has: $first
    echo wd1 has: $second
    echo
  else
    : #echo $i is OK
  fi
done
echo
echo TOTAL FILES PROCESSED = $filecnt
echo TOTAL BROKEN FILES    = $badcnt













# dmesg
NetBSD 3.1_STABLE (GENERIC) #3: Sun Apr 15 15:16:08 EDT 2007
        louis@maat.zabrico.com:/usr/obj/sys/arch/i386/compile/GENERIC
total memory = 223 MB
avail memory = 210 MB
BIOS32 rev. 0 found at 0xfdae0
mainbus0 (root)
cpu0 at mainbus0: (uniprocessor)
cpu0: AMD Athlon (686-class), 1110.96 MHz, id 0x662
cpu0: features c3cbfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features c3cbfbff<PGE,MCA,CMOV,PAT,PSE36,MPC,MMXX,MMX>
cpu0: features c3cbfbff<FXSR,SSE,3DNOW2,3DNOW>
cpu0: "AMD Athlon(tm) Processor"
cpu0: I-cache 64 KB 64B/line 2-way, D-cache 64 KB 64B/line 2-way
cpu0: L2 cache 256 KB 64B/line 16-way
cpu0: ITLB 16 4 KB entries fully associative, 8 4 MB entries fully
associative
cpu0: DTLB 32 4 KB entries fully associative, 8 4 MB entries 4-way
cpu0: 8 page colors
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: VIA Technologies product 0x3116 (rev. 0x00)
agp0 at pchb0: aperture at 0xe0000000, size 0x10000000
ppb0 at pci0 dev 1 function 0: VIA Technologies VT8633 (Apollo Pro 266)
CPU-AGP Bridge (rev. 0x00)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
vga1 at pci1 dev 0 function 0: S3 product 0x8d04 (rev. 0x00)
wsdisplay0 at vga1 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
ahc1 at pci0 dev 10 function 0: Adaptec 2940 Ultra SCSI adapter
ahc1: interrupting at irq 10
ahc1: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
scsibus0 at ahc1: 16 targets, 8 luns per target
satalink0 at pci0 dev 11 function 0
satalink0: Silicon Image SATALink 3512 (rev. 0x01)
satalink0: SATALink BA5 register space disabled
satalink0: bus-master DMA support present
satalink0: primary channel wired to native-PCI mode
satalink0: using irq 12 for native-PCI interrupt
atabus0 at satalink0 channel 0
satalink0: secondary channel wired to native-PCI mode
atabus1 at satalink0 channel 1
pcib0 at pci0 dev 17 function 0
pcib0: VIA Technologies VT8233A PCI-ISA Bridge (rev. 0x00)
viaide0 at pci0 dev 17 function 1
viaide0: VIA Technologies VT8233A ATA133 controller
viaide0: bus-master DMA support present
viaide0: primary channel configured to compatibility mode
viaide0: primary channel interrupting at irq 14
atabus2 at viaide0 channel 0
viaide0: secondary channel configured to compatibility mode
viaide0: secondary channel interrupting at irq 15
atabus3 at viaide0 channel 1
uhci0 at pci0 dev 17 function 2: VIA Technologies VT83C572 USB
Controller (rev. 0x23)
uhci0: interrupting at irq 12
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: VIA Technologies UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1 at pci0 dev 17 function 3: VIA Technologies VT83C572 USB
Controller (rev. 0x23)
uhci1: interrupting at irq 12
usb1 at uhci1: USB revision 1.0
uhub1 at usb1
uhub1: VIA Technologies UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
rtk0 at pci0 dev 19 function 0: Realtek 8139 10/100BaseTX (rev. 0x10)
rtk0: interrupting at irq 10
rtk0: Ethernet address 00:20:ed:47:c2:96
rlphy0 at rtk0 phy 7: Realtek internal PHY
rlphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
npx0 at isa0 port 0xf0-0xff: using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
isapnp0: no ISA Plug 'n Play devices found
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
satalink0: port 0: device present, speed: 1.5Gb/s
wd0 at atabus0 drive 0satalink0: port 1: device present, speed: 1.5Gb/s
: <ST3120026AS>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 111 GB, 232581 cyl, 16 head, 63 sec, 512 bytes/sect x 234441648 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd0(satalink0:0:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133)
(using DMA)
wd1 at atabus1 drive 0: <ST3120026AS>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 111 GB, 232581 cyl, 16 head, 63 sec, 512 bytes/sect x 234441648 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1(satalink0:1:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133)
(using DMA)
st0 at scsibus0 target 1 lun 0: <IBM, ULT3580-TD2, 38D0> tape removable
st0: density code 64, variable blocks, write-enabled
st0: sync (50.00ns offset 8), 16-bit (40.000MB/s) transfers
wd2 at atabus2 drive 0: <Maxtor 53073H4>
wd2: drive supports 16-sector PIO transfers, LBA addressing
wd2: 29311 MB, 59554 cyl, 16 head, 63 sec, 512 bytes/sect x 60030432 sectors
wd2: 32-bit data port
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd2(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
<...cut...>