Subject: Re: Data corruption issues, probably involving ffs2 and >1Tb (SOLVED?)
To: Daniel Carosone <dan@geek.com.au>
From: Nino Dehne <ndehne@gmail.com>
List: current-users
Date: 01/24/2007 19:35:38
Hi there,

first, I'm feeling really stupid and I'm terribly sorry to have caused
such an uproar. It appears that the issue _was_ hardware-based after all.
At least that's how things look currently. Let me explain:

Before messing around further I wanted to try the setup in my desktop
box. So I swapped disks, using a different add-on controller than in
the server and also using different cables.

The issue didn't show up. OK, a bit let down that the new server hardware
might be flaky and not knowing exactly which part of it, I tried running
the same setup in the desktop with the add-on controller from the server
(HPT371 single-channel). This brought back the dreaded no-panic-no-nothing-
lockups I had experienced in the server earlier already. Back then, I
used both the HPT and an additional SiI0680 cmdide(4) controller so that
all disks had their dedicated channel. Seeing those lockups on the desktop
now immediately raised a flag.

It dawned on me that the cause of the lockups earlier might not have been
the cmdide(4) controller I ripped out but instead the hptide(4) one. The
cmdide(4) had other issues in the desktop box, though (lost interrupts).

I swapped all disks back to the server and replaced the HPT with a Promise
Fasttrak100. And what can I say, 200 runs without a single error. I will
watch things closely but I'm confident.

I still don't understand the symptoms fully, though.

On Mon, Jan 22, 2007 at 08:45:19AM +1100, Daniel Carosone wrote:
> > As a wild guess, I resolved all IRQ conflicts on the machine. 
> > [..]
> > Both steps helped nothing to resolve the issue.
> 
> These were unlikely at this point, but thanks for going to the effort
> of eliminating them.

As it turned out, nothing seems to be unlikely. :/ I would have never
expected the controller to be flaky either. Especially not when I do huge
transfers from a raw device without an error. Do you think there might
still be a bug in NetBSD, but instead of the FFS code it's hptide(4) with
that specific controller?

Anyway, thanks a lot for your efforts everyone and sorry for the trouble.

Best regards,

ND