Subject: Re: sudden instability on 1.6.1
To: None <netbsd-users@netbsd.org>
From: Martin Schmitz <martin-schmitz@web.de>
List: netbsd-users
Date: 07/31/2003 01:32:36
"Steven M. Bellovin" <smb@research.att.com> writes:
> In message <HIr3Dw.JKL@tac.nyc.ny.us>, Christos Zoulas writes:
>>In article <20030728170854.040F37C92@berkshire.research.att.com>,
>>Steve Bellovin <smb@research.att.com> wrote:
>>>A 1.6.1 machine of mine has suddenly started crashing, for no apparent
>>>reason. For the last crash, I deliberately left it not running X, so
>>>I could see any messages:
>>>
>>>/tmp: got error 5 while accessing file system
>>>panic: softdep_deallocate_dependencies: unrecovered I/O error
>>
>>Hmm, 5 = EIO, I see a few places in sd.c where EIO is returned but
>>does not make a lot of sense to me. I'd add some printf's and see
>>which one is causing it.
>>
>
> I will add some printfs. For now, I made /tmp an MFS file system and
> turned off softdep on the other partition on the drive. When I did
> that, I got the following in /var/log/messages when I tried listing the
> root directory of that file system:
>
> Jul 30 14:30:56 sigaba /netbsd: sd1(ahc1:0:1:0): SCB 14 - timed out while idle, SEQADDR == 0xa
> Jul 30 14:30:57 sigaba /netbsd: SCSIRATE == 0x0
> Jul 30 14:30:57 sigaba /netbsd: sd1(ahc1:0:1:0): Queuing a BDR SCB
> Jul 30 14:30:57 sigaba /netbsd: sd1(ahc1:0:1:0): no longer in timeout, status = 0
>
> 'ls' said 'Input/output error'.
>
> It's clear that I have a hardware problem, though whether it's the
> drive, the controller, or the cable is still unclear to me. There are
> some NetBSD issues, too, such as why I didn't see any kernel error
> messages when I had softdep enabled, or why the system panicked when
> the softdep layer received EIO from the driver. That latter is
> unacceptable, I think; it's reminiscent of 6th Edition Unix and
> earlier, where you were told to buy error-free disk packs because the
> drivers couldn't recover....
It seems to me that there are some major bugs in the ahc driver. At
least someone was talking about those problems in the FreeBSD
implementation - probably these drivers share the same codebase.
,----
| ~> dmesg | grep ahc
| ahc0 at pci0 dev 18 function 0
| ahc0: interrupting at irq 11
| ahc0: Using left over BIOS settings
| ahc0: aic7860 Single Channel A, SCSI Id=7, 3/255 SCBs
| scsibus0 at ahc0: 8 targets, 8 luns per target
`----
My system also crashes on random occasions without any error
messages. I'm quite sure that it's *not* a hardware problem.
Using softdeps finaly left the filesystem in an unrecoverable state so I
now mount that partitions without - and I can turn off power without
getting severe errors on the next boot.
Martin
P.S.: Sometimes I can make havy use of this system, compiling lot of
things etc., for weeks without crashes - and sometimes crashes happens
twice a day. The crash is always a total freeze without any messages.