Subject: Re: sudden instability on 1.6.1
To: None <netbsd-users@netbsd.org>
From: Martin Schmitz <martin-schmitz@web.de>
List: netbsd-users
Date: 07/31/2003 01:32:36
"Steven M. Bellovin" <smb@research.att.com> writes:

> In message <HIr3Dw.JKL@tac.nyc.ny.us>, Christos Zoulas writes:
>>In article <20030728170854.040F37C92@berkshire.research.att.com>,
>>Steve Bellovin <smb@research.att.com> wrote:
>>>A 1.6.1 machine of mine has suddenly started crashing, for no apparent 
>>>reason.  For the last crash, I deliberately left it not running X, so 
>>>I could see any messages:
>>>
>>>/tmp: got error 5 while accessing file system
>>>panic: softdep_deallocate_dependencies: unrecovered I/O error
>>
>>Hmm, 5 = EIO, I see a few places in sd.c where EIO is returned but
>>does not make a lot of sense to me. I'd add some printf's and see
>>which one is causing it.
>>
>
> I will add some printfs.  For now, I made /tmp an MFS file system and 
> turned off softdep on the other partition on the drive.  When I did 
> that, I got the following in /var/log/messages when I tried listing the 
> root directory of that file system:
>
> Jul 30 14:30:56 sigaba /netbsd: sd1(ahc1:0:1:0): SCB 14 - timed out while idle, SEQADDR == 0xa
> Jul 30 14:30:57 sigaba /netbsd: SCSIRATE == 0x0
> Jul 30 14:30:57 sigaba /netbsd: sd1(ahc1:0:1:0): Queuing a BDR SCB
> Jul 30 14:30:57 sigaba /netbsd: sd1(ahc1:0:1:0): no longer in timeout, status = 0
>
> 'ls' said 'Input/output error'.
>
> It's clear that I have a hardware problem, though whether it's the 
> drive, the controller, or the cable is still unclear to me.  There are
> some NetBSD issues, too, such as why I didn't see any kernel error 
> messages when I had softdep enabled, or why the system panicked when 
> the softdep layer received EIO from the driver.  That latter is 
> unacceptable, I think; it's reminiscent of 6th Edition Unix and 
> earlier, where you were told to buy error-free disk packs because the 
> drivers couldn't recover....

It seems to me that there are some major bugs in the ahc driver. At
least someone was talking about those problems in the FreeBSD
implementation - probably these drivers share the same codebase.

,----
| ~> dmesg | grep ahc 
| ahc0 at pci0 dev 18 function 0
| ahc0: interrupting at irq 11
| ahc0: Using left over BIOS settings
| ahc0: aic7860 Single Channel A, SCSI Id=7, 3/255 SCBs
| scsibus0 at ahc0: 8 targets, 8 luns per target
`----

My system also crashes on random occasions without any error
messages. I'm quite sure that it's *not* a hardware problem.

Using softdeps finaly left the filesystem in an unrecoverable state so I
now mount that partitions without - and I can turn off power without
getting severe errors on the next boot.

Martin

P.S.: Sometimes I can make havy use of this system, compiling lot of
things etc., for weeks without crashes - and sometimes crashes happens
twice a day. The crash is always a total freeze without any messages.