port-i386: Re: installing/running 1.4D

Subject: Re: installing/running 1.4D
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: jiho <root@mail.c-zone.net>
List: port-i386
Date: 10/06/1999 13:36:01
>> 1.  The new boot sectors from FreeBSD, which use extended BIOS calls for 
>> LBA access (and thus let you boot from beyond 8 GB) work great.  I have 
>> 1.4D in a partition that occupies exactly all of the drive beyond 8 GB -- 
>> because I don't have any other OSes that can do that -- and am using the 
>> MBR menu as well.  It all just works.
> 
> I think the boot sectors from NetBSD should work as well.

Ah, I _meant_ the NetBSD boot sectors, which are stated to be from (that is, 
_ported_ from) FreeBSD.  The formal releases of FreeBSD don't use them yet.  
Sorry about the vague statement.

The point was, this new feature in NetBSD/i386 seems to work very well, and 
without it full use of the new drive would be more difficult, sharing with 
other OSes.

>> 3.  The lost "pciide lost interrupt" problem:  In my opinion, this is not 
>> cabling, but the drive doing a thermal recalibration.  Notice everyone says 
>> this happens while doing a very large transfer of some kind.  It happens to 
>> me while extracting huge tarballs -- exactly once in mid-extraction.  This
>> happens on my SCSI drive, but because that drive is so noisy I can tell 
>> it's a recalibration due to heat from the heavy use.  Because SCSI has a
>> protocol, the driver knows not to complain.  This "UltraDMA" is just a 
>> controller connection, basically, so I guess the only way to make this go 
>> away is to use a longer timeout for _all_ transfers.
> 
> Does it just complain about a timeout, or also with a DMA error of some kind
> (I guess "lost interrupt" here). The timeout is already 10s, so I would be
> surprised if it needed more than that to recalibrate.

First of all, what I said about my SCSI drive was misleading.  That drive does 
a thermal recalibration on a timer basis, no matter what is happening, whether 
it really needs to or not.  And because it's noisy, it makes a recognizable 
sound while recalibrating, so I know when it's recalibrating.

This UDMA drive, I have no idea how it handles things.  It is a much cooler, 
much quieter drive, though, so quiet that I have trouble hearing it.  But I 
too would be surprised if it needed more than 10 seconds to recalibrate, 
whatever method it uses.

But when this problem occurs, the pause is very brief.  In fact, everything 
rolls by so fast, I hardly have time to notice that something happened and 
read the message from the kernel about it.  Whatever the cause, I would be 
_very_ surprised if the timeout is 10 seconds, actually.

However, when I went looking for the place in the source code where the 
timeout value is set, I couldn't find it.  (The subsystem is complicated, but 
I usually have trouble deciphering other people's code anyway.)

I found some code that matches the message on the screen.  In <dev/ic/wdc.c>, 
wdctimeout() has the "lost interrupt" part; in <dev/scsipi/atapi_base.c>, 
atapi_interpret_sense() has the "soft error (corrected)" part.

Why the scsipi ATAPI code is involved, I don't know.  I have no ATAPI devices.

The timeout() call is made in <dev/ic/wdc.c>, __wdccommand_start(), which 
pulls the value out of the wdc_command structure passed in to it, like so:

  wdc_c->timeout / 1000 * hz

The only thing I found with 10 seconds is in <dev/ic/wdcvar.h>:

  #define WAITTIME  (10 * hz)

but I don't see where that enters into this.

The wdc_c->timeout value is 0 by default.  A couple of functions in 
<dev/ata/ata.c> set it -- which have nothing to do with this -- one to 1 
second, another to 30.

Meanwhile, I have been completely unable to reproduce the problem, yesterday 
or today.  What is the difference?  Well, it's much cooler now than when I had 
the problem.  Then we were in a heat wave, and I was in my shorts.  Now, I am 
wearing a sweater.

I suppose the cable might have been overheating, but with hardware problems 
like that, you tend to see it happening more frequently, and sporadically, 
seemingly randomly, in a variety of situations.  This happened to me one very 
rare incident at a time, in one very specific situation.

> --
> Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
> --


--Jim Howard  <jiho@mail.c-zone.net>