port-cobalt: Re: Qube2 crashes every night

Subject: Re: Qube2 crashes every night
To: NetBSD Cobalt list <port-cobalt@netbsd.org>
From: Andy Ruhl <acruhl@gmail.com>
List: port-cobalt
Date: 11/08/2005 05:20:13
On 11/7/05, Andy Ruhl <acruhl@gmail.com> wrote:
> On 11/5/05, Pete Rushmere <pete@rushmere.org> wrote:
> > Andy,
> >
> >          Same time every night? There's a cron job that runs at 3 every
> > morning and does some disk clean up stuff... amongst other things.
> >
> > Kind Regards,
> > Pete.
> >
> >
> > At 12:58 05/11/2005, Andy Ruhl wrote:
> > >I recently built release-3 for my Qube2. It had a bad disk (I think?)
> > >so I replaced it with an IBM 60 gig drive.
> > >
> > >Ever since I did that, the Qube crashes every night. I'm guessing
> > >while doing updatedb.
> > >
> > >Here's the end of the dmesg:
> > >
> > >dev =3D 0x600 bno =3D 26929766 bsize =3D 16384, size =3D 16384, fs =3D=
 /
> > >panic: blkfree: bad size
> > >
> > >I'll see if I can save the panic and get a backtrace from it.
> > >
> > >On searching on this, it seems to be either a bad disk or a bad contro=
ller?
>
> Ok, so maybe I spoke too soon earlier when another person on the list
> bashed NetBSD for not being stable :)
>
> I put in another disk, one that I know works pretty good otherwise.
>
> I got another crash today, when I was doing a build of php5 and I was
> ftp'ing some data off the Qube. Here's some output:
>
> db> bt
> r5k_pdcache_wb_range_32+60 (cb72fe20,cb7303e0,5ea,5ea) ra 801da6bc sz 0
> 801da60c+b0 (cb72fe20,cb7303e0,5ea,5ea) ra 0 sz 0
> User-level: pid 13774.1
>
> And the end of the dmesg:
>
> root on wd0a dumps on wd0b
> root file system type: ffs
> trap: TLB miss (load or instr. fetch) in kernel mode
> status=3D0x2403, cause=3D0x8008, epc=3D0x801d50a4, vaddr=3D0xcb730000
> pid=3D13774 cmd=3Dftpd usp=3D0x7fffcd10 ksp=3D0xcb75fb08
> db>
>
> I'll see if I can get the crash to go to the dump device and go from ther=
e.
>
> Thanks for any help. I'll start searchign on this stuff in the morning.

Copying tech-kern this time.

And again:

Here's the bt output:
db> bt
801c9564+214 (89ffe000,0,bc800000,d) ra 801475ec sz 0
panic+190 (89ffe000,d,bc800000,65) ra 800d3fb8 sz 40
800d38b8+700 (89ffe000,d,bc800000,65) ra 0 sz 0
User-level: pid 6.1

And here's the dmesg:

root on wd0a dumps on wd0b
root file system type: ffs
dev =3D 0x600, bno =3D 6179277 bsize =3D 16384, size =3D 16384, fs =3D /
panic: blkfree: bad size
db>

This is a generic release-3 kernel. All I've done is I've changed some
of the vm. sysctls to see how it affects things based on another
thread I've been following recently. I also have softdeps set.

Also, I notice this because an ssh session no longer responds to
commands, so I log in via the serial console. I have ddb.fromconsole=3D0
(default is 1) so it doesn't cause a panic (I had that problem
before).

This may be related to some other open bugs, but I'm not sure. I could
try a -current kernel I suppose.

Thanks.

Andy