Subject: Re: MSI 6501 Dual AMD Athlon MP & 1.6 i386 MP kernel
To: None <tech-smp@NetBSD.ORG>
From: MLH <MLH@goathill.org>
List: tech-smp
Date: 09/18/2002 18:53:12
On 18 Sep 2002 09:10:00 -0500, Frank van der Linden wrote:
> On Wed, Sep 18, 2002 at 12:26:56AM +0000, MLH wrote:
>> 2) running cpuburn (burnK7), cpu1 will lock up every time. Sometimes
>> it takes 2 minutes and sometimes it takes 20.  I'm running a single
>> processor kernel right now to see if I can get cpu0 to lock up. I
>> assume that cpu0 is used by the SP kernel - the lm sensor indicates
>> that it is the same one.
> 
> How would you define 'cpu1 locks up', i.e. how have you verified that
> this is indeed the case? Is it something like 'if two processes
> are started, the 2nd one should run on cpu1 but it never does'?

All process running on cpu1 cease running. No more processes can
be scheduled on it. None of the processes running on it can be
killed with 'kill -9'. All activity associated with that cpu is
dead. If XFree86 was running on it, it is dead. No mouse activity,
no keyboard, no console switch to allow another login. If cpu0 is
running, I can ssh in and look around. I can find no corefile or
any other evidence that cpu1 left that makes me think that it did
anything other than simply stop running, or 'freeze'.  Top shows
that there is no cpu time being spent on any process assigned to
cpu1 and envstat shows the cpu going back to idle temperature.

> Unfortunately it is hard to tell what the problem is here without
> any debugger output. This problem is also not known to me, and
> I can't reproduce it myself (on my dual Athlon Tyan board, used
> for my desktop).

Ok. Can you give me some suggestions? The cpu that dies simply
stops running. There is no coredump, nothing drops into the debugger.
It simply stops - period.  If cpu1 stops, cpu0 is still avaliable,
but I don't know how to interrogate the status of cpu1 with the
debugger, from cpu0. std commands to evaluate process running on
that cpu either simply return, or hang.  For example, say lpd was
running on cpu1 and it stops. I do $ lpc status and all I get back
is a newline.

If cpu0 stops, the entire machine stops. It doesn't respond to
anything. No ping, no keyboard, no nothing. It's dead. It acts
precisely like a heat-related cpu (temporary) failure that I have
seen many times on Athlons that are under-cooled.

However, I don't think this really is heat-related :

If I boot a sp kernel, the box is a solid as can be. Nothing I do
(with software) can make it fail.  I just ran burnK7 on cpu0 (sp
mode) all night long last night and nothing happened.

If running an mp kernel, burnK7 will 'freeze' cpu1 every time,
often in a matter of a few minutes, or if XFree86 is running, simply
leaving the xdm login on overnight almost always results in a
'frozen' cpu1 or cpu0 (I assume cpu0 is frozen, because the box is
essentially dead).

> For the X problem, you could try to set some more conservative
> AGP settings in the BIOS, but I'm grasping at straws there.

Already tried everything there. I'm going to try another gfx card
in it, but how does that address the issue where the only terminal
activity is via ssh and cpuburn still manages to lock up the cpu
under conditions which I'm pretty sure are not temperature related?
(I ordered an infrared thermometer Monday to possibly give me one
more point of reference on this, but earlier we looked at this and
didn't see a problem)

> If you think cpu1 may be dodgy, try to rule out hardware failure
> by replacing it with another one. Or exchange the CPUs on one
> board.

We have 6 identical boxes here. Every one that we have installed
NetBSD on has the same exact problem. I'm beginning to believe that
this is not a hardware problem (also, Solaris86 mp runs great on
these boxes with no problems whatsoever).

Might this be an interrupt-related problem with the mp kernel?
Maybe the gfx card isn't properly supported, providing the answer
to X, and maybe cpu1 is just missing interrupts?

What happens with your mp kernel when you run burnK7 on it (either
on cpu1 or both cpus)?

Thanks