Subject: Re: sudden instability on 1.6.1
To: None <netbsd-users@NetBSD.org>
From: Steven M. Bellovin <smb@research.att.com>
List: netbsd-users
Date: 07/30/2003 14:45:59
In message <HIr3Dw.JKL@tac.nyc.ny.us>, Christos Zoulas writes:
>In article <20030728170854.040F37C92@berkshire.research.att.com>,
>Steve Bellovin <smb@research.att.com> wrote:
>>A 1.6.1 machine of mine has suddenly started crashing, for no apparent 
>>reason.  For the last crash, I deliberately left it not running X, so 
>>I could see any messages:
>>
>>/tmp: got error 5 while accessing file system
>>panic: softdep_deallocate_dependencies: unrecovered I/O error
>
>Hmm, 5 = EIO, I see a few places in sd.c where EIO is returned but
>does not make a lot of sense to me. I'd add some printf's and see
>which one is causing it.
>

I will add some printfs.  For now, I made /tmp an MFS file system and 
turned off softdep on the other partition on the drive.  When I did 
that, I got the following in /var/log/messages when I tried listing the 
root directory of that file system:

Jul 30 14:30:56 sigaba /netbsd: sd1(ahc1:0:1:0): SCB 14 - timed out while idle, SEQADDR == 0xa
Jul 30 14:30:57 sigaba /netbsd: SCSIRATE == 0x0
Jul 30 14:30:57 sigaba /netbsd: sd1(ahc1:0:1:0): Queuing a BDR SCB
Jul 30 14:30:57 sigaba /netbsd: sd1(ahc1:0:1:0): no longer in timeout, status = 0

'ls' said 'Input/output error'.

It's clear that I have a hardware problem, though whether it's the 
drive, the controller, or the cable is still unclear to me.  There are
some NetBSD issues, too, such as why I didn't see any kernel error 
messages when I had softdep enabled, or why the system panicked when 
the softdep layer received EIO from the driver.  That latter is 
unacceptable, I think; it's reminiscent of 6th Edition Unix and 
earlier, where you were told to buy error-free disk packs because the 
drivers couldn't recover....


		--Steve Bellovin, http://www.research.att.com/~smb