Subject: 160gb drive on 1.6
To: None <netbsd-help@netbsd.org>
From: Erik Osheim <erik@plastic-idolatry.com>
List: netbsd-help
Date: 12/03/2003 15:23:36
Hello,

Recently bought an extra hard drive for my NetBSD server (i386,
NetBSD-1.6). The drive was 120 GB but got upped to 160 somehow (maybe they
were out of stock; not important). Anyway, I remember hearing earlier that
160 GB drives and larger weren't (fully?) supported in NetBSD. However, I
hoped that it would just get recognized as a smaller drive, and that would
be that.

Instead, it did recognize it as a (149GB) drive, which I think is 160 GB
when you account for industry inflation (1000MB vs 1024, etc). I figured I
was lucky and support had been added.

However, I started getting soft-errors on reads and writes, and pretty
soon those errors turned into unrecoverable errors. Worse, they were
happening on my other drives' filesystems too, not just on the 160 GB.
Pretty soon, /var was unusable and the machine had to be taken down.

Here is a sample of some of the errors that got introduced (from
/var/log/messages):

Dec  1 01:16:55 cage /netbsd: pciide1:0:0: lost interrupt
Dec  1 01:16:55 cage /netbsd:   type: ata tc_bcount: 14848 tc_skip: 1536
Dec  1 01:16:55 cage /netbsd: wd1e: device timeout reading fsbn 143843139
of 143
843136-143843167 (wd1 bn 143843139; cn 142701 tn 8 sn 27)
Dec  1 01:17:17 cage /netbsd: wd1e: error reading fsbn 146153712 of
146153600-14
6153727 (wd1 bn 146153712; cn 144993 tn 12 sn 12), retrying
Dec  1 01:17:17 cage /netbsd: wd1: (aborted command)
Dec  1 01:17:18 cage /netbsd: wd1: soft error (corrected)
Dec  1 01:18:05 cage /netbsd: wd1e: error reading fsbn 144488336 of
144488320-14
4488447 (wd1 bn 144488336; cn 143341 tn 9 sn 41), retrying
Dec  1 01:18:05 cage /netbsd: wd1: (uncorrectable data error)
Dec  1 01:18:10 cage /netbsd: wd1e: error reading fsbn 144488336 of
144488320-14
4488447 (wd1 bn 144488336; cn 143341 tn 9 sn 41), retrying
Dec  1 01:18:10 cage /netbsd: wd1: (uncorrectable data error)

Some experimentation showed that the drives didn't have to be on the same
channel as the 160 to get fouled up. I was using the onboard ATA
controller (Asus motherboard, not sure what kind of controller, again, I
can check if need be), but the PCI ATA controller *seemed* to be immune
from the problems the 160 introduced (not entirely sure).

Fortunately, I had most everything backed up. The resulting crash and
reinstall took awhile, but things seem to be stable again now that I have
removed the 160GB drive (i.e. my other 80 GB drives were not bad, and
the controller was not bad itself).

My questions are:

1. Is this a problem specific to the drive I bought (160GB Western Digital
Caviar... I can furnish more specs if necessary, it's not in front of me),
or a problem NetBSD has with all large drives using DMA?

2. If this is a general problem, is this fixed in -current? Is there work
being done to fix it?

3. In the meantime, is there any place where we document
supported/unsupported configurations? I checked the supported devices, and
didn't see any mention of ATA hard drives to avoid, and I couldn't find
any mention of these kinds of errors (lost interrupts, etc)...

I don't know much about how DMA in the kernel works, so any ideas on
what's going on here would be welcome. If this email belongs in a
different list (port-i386, tech-kernel) please let me know.

Thanks,

-- Erik