Subject: Re: IDE UDMA hangs on BP6 (HPT366)
To: Kazushi (Jam) Marukawa <jam@pobox.com>
From: Roger Brooks <R.S.Brooks@liverpool.ac.uk>
List: port-i386
Date: 08/29/2000 15:29:03
On Mon, 28 Aug 2000, Kazushi (Jam) Marukawa wrote:

>Last weekend, I got errors through other configurations.  At
>Aug. 25, I cvsed latest kernel and changed to use it.  This
>kernel reduces a possibility of hang.  I cannot re-produce
>hang easily by using only tar program now.  I have to use
>the machine until it hang.  It's bad for me.
>
>Anyway, here are configurations what I got troubles.

>2. DJNA (IDE3-1) and Max40 (IDE4-1) configuration hangs with
>   following error messages.  pciide2:0:0 is IDE4-1.  300W PS
>
>  wd1e: DMA error reading fsbn 5516056 of 5516056-5516057 (wd1 bn 7347466; cn 7289 tn 2 sn 28), retrying
>  wd1: soft error (corrected)
>  pciide2:0:0: lost interrupt
>          type: ata tc_bcount: 1024 tc_skip: 0
>  pciide2:0:0: bus-master DMA error: missing interrupt, status=0x21


Actually, this reminds me of a problem which I found when I was playing
with my BP6 earlier this year.  When I first thought I had the HPT366
working, I tested it by using dd to read from wdXd.  Now I have noticed
that for an unlabelled disk, the 'd' partition in the faked in-kernel
disk label usually seems to be too big (it is like this in 1.4, and maybe
even in 1.3.2).  So if you use dd to read the whole disk you get a hard
error when you hit the end of what's really there.

However, what was different with the HPT366 was that I not only got a
hard error, but also a lost interrupt, and the machine locked solid.
I meant to send-pr it, but by the time I'd sorted out other problems
with the HPT366, completely forgot about it.

I wonder if part of the problem is that we've all bought brand-new
UDMA-66 capable disks to go with our HPT366's, and so far none of them
has given any errors?

Suggestion: could someone with a HPT366 which is apparently working
OK attach a known faulty disk which gives reliable errors (!) and see if it
hangs the machine with "lost interrupt"?  I can't (easily) try this,
as my BP6 system is at home, and I haven't been tracking current for
the past several months.  I don't think the faulty disk would necessarily
have to be UDMA-66.

I realise that this doesn't answer the question of why the errors are
happening in the first place, but the description suggests that there
may be  a more serious problem (i.e. any error on a HPT366 channel hangs
the machine completely, even though the pciide code should recover (and
possibly downgrade the transfer rate)).


Roger

------------------------------------------------------------------------------
Roger Brooks (Systems Programmer),          |  Email: R.S.Brooks@liv.ac.uk
Computing Services Dept,                    |  Tel:   +44 151 794 4441
The University of Liverpool,                |  Fax:   +44 151 794 4442
PO Box 147, Liverpool L69 3BX, UK           | 
------------------------------------------------------------------------------