current-users: Re: Data corruption issues possibly involving cgd(4)

Subject: Re: Data corruption issues possibly involving cgd(4)
To: Daniel Carosone <dan@geek.com.au>
From: Nino Dehne <ndehne@gmail.com>
List: current-users
Date: 01/18/2007 09:21:25
On Wed, Jan 17, 2007 at 11:58:56PM +0100, Nino Dehne wrote:
> On Thu, Jan 18, 2007 at 07:31:47AM +1100, Daniel Carosone wrote:
> > Nino, are you running a kernel with DIAGNOSTIC and/or DEBUG?  Looking
> > at the cgd panic you found, I'm guessing not, because the path we see
> > to that problem would have involved one or more likely DIAGNOSTIC
> > messages.
> 
> Not yet, but that just went on my list of things to try.

I'm now running the system with those options. I didn't try to provoke
the cgd panic yet, though. Parity recalculation is a lengthy process.


> 1) Boot DIAGNOSTIC+DEBUG kernel
> 2) Run fsck -f[1]

I ran fsck -fn 10 times in a row, with 4 gzips running concurrently.
Nothing. Output looked like this each time:

** /dev/rcgd0a (NO WRITE)
** File system is already clean
** Last Mounted on /home
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
678270 files, 138286777 used, 8460366 free (8334 frags, 1056504 blocks, 0.0% fragmentation)

What I did manage was getting two samples of how the corruption looks like.
I copied a ~650M file to /var/tmp where the corruption never occurred so far.
I verified this by hashing it 100 times without an error. The file is
a .rar file so I could verify its integrity.

I then wrote a little script that copied the same file from cgd0a to /var/tmp
over and over again with a different name, hashing it and aborting if the hash
mismatched the predetermined value.

Then I ran cmp -l /var/tmp/<good file> /var/tmp/<bad file>:

503124993 246 310
503124994 132 251
503124995 230 221
503124996 211 351
503124997  51  46
503124998 214 173
503124999 374 122
503125000 144 331
503125001 134 141
503125002 150 336
503125003  46 247
503125004 266 153
503125217 257 211
503125218 303 217
503125219 111  14
503125220  70 227
503125221   2 316
503125222 343 340
503125223 207 372
503125224 350 210
503125229 100  67
503125230  64 145
503125231 262 327
503125232 205 146

Another run of the script got me another sample. cmp -l:

502883433 167 363
502883434 141 126
502883435  26  11
502883436 311  67
502883437  25 153
502883438 302 103
502883439 145  40
502883440 103  71
502883445 346 174
502883446  45  60
502883447 333 262

I managed to get both samples with under 20 runs of the script.

File size is exactly 678765312 bytes. For good measure I hashed both the
good and the bad copy 100 times while they were on /var: no mismatch from
their actual hash values.

As a wild guess, I resolved all IRQ conflicts on the machine. The extra
IDE controller shares an interrupt with one of the USB controllers, so I
disabled USB temporarily.

Also, since all disks have separate scratch partitions on them besides the
respective RAID component, I did the usual hashing loop on the disk that's
connected to the separate controller[1].

Both steps helped nothing to resolve the issue.


> 3) Last resort: transfer disks to my desktop machine and try to reproduce
>    the problem

That will have to wait. I will see to reproducing the cgd panic while at it.

Best regards,

ND


[1]:
hptide0 at pci0 dev 9 function 0
hptide0: Triones/Highpoint HPT371 IDE Controller
hptide0: bus-master DMA support present
hptide0: primary channel wired to native-PCI mode
hptide0: using ioapic0 pin 16 (irq 7) for native-PCI interrupt