Subject: Re: Diagnosing reboots
To: Christopher W. Richardson <cwr@nexthop.com>
From: Onno Ebbinge <onno.ebbinge@gmail.com>
List: netbsd-help
Date: 09/13/2005 10:03:46
Hi Christopher,

Spontaneous or mysterious reboots can sometimes be caused by
bad memory. Could you run the memtest tool (overnight?) from=20
http://www.memtest.org/  This memtester detects more memory=20
problems than other memtesters I used. Besides bit errors it
also detects latency problems, it is very thorough.

Good luck,
Onno


On 9/13/05, Christopher W. Richardson <cwr@nexthop.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>=20
> Hey folks,
>=20
> Sorry for such a potentially basic user question, but it happens
> so infrequently that I'm at a loss for basic admin skills.  How
> do I go about diagnosing the reason for a machine rebooting?
>=20
> I came in to my office this morning to find this on my
> workstation:
>=20
> cwr@achilles#uptime
>  7:55PM  up 15:54, 2 users, load averages: 0.20, 0.15, 0.10
>=20
> OK, this morning it was less than 15 hours uptime, but, you get
> the idea.  I have no idea what caused the workstation to reboot.
> The end of the reboot shows:
>=20
> Sep 12 04:03:31 achilles /netbsd: root file system type: ffs
> Sep 12 04:03:31 achilles savecore: no core dump
>=20
> and the beginning shows:
>=20
> Sep 11 21:00:10 achilles syslogd: restart
> Sep 12 04:03:31 achilles syslogd: restart
> Sep 12 04:03:31 achilles /netbsd: NetBSD 2.0.2_STABLE (ACHILLES)
> #10: Sun Sep  4
>  13:06:05 EDT 2005
> Sep 12 04:03:31 achilles /netbsd: cwr@achilles:/usr/localhome2/obj/sys/ar=
ch/i386/compile/ACHILLES
>  Sep 12 04:03:31 achilles /netbsd: total memory =3D 254 MB
>  Sep 12 04:03:31 achilles /netbsd: avail memory =3D 245 MB
>=20
> So it appears that it neither dumped core nor logged a reason for
> the reboot.  The authlog shows:
>=20
> Sep 11 20:27:47 achilles sshd[11619]: Accepted password for cwr from 192.=
168.10.29 port 3058 ssh2
> Sep 12 04:03:29 achilles sshd[433]: Server listening on :: port 22.
> Sep 12 04:03:29 achilles sshd[433]: Server listening on 0.0.0.0 port 22.
> Sep 12 12:01:45 achilles sshd[785]: Accepted password for cwr from 65.241=
.132.123 port 1174 ssh2
>=20
> Which appears to indicate that no one became root anywhere near
> the time of the reboot (and a more thorough search of the log
> confirms that no one had done so within days, other than me).
>=20
> What's the proper way to go about diagnosing this (oh, and please
> cc me, as I'm not on this list)?
>=20
> Thanks,
> Chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2 (NetBSD)
> Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>
>=20
> iD8DBQFDJhitP65RBOOHTzERApBWAJ0bcVMSMTBLmVMAcjSmzX7bNdMknQCdFMbW
> S9fQnWbRxNLbQ1HSMrkBP+U=3D
> =3DC3fC
> -----END PGP SIGNATURE-----
>