Subject: Re: I noticed that NetBSD 1.1....
To: The Unseen <dickens@itd.nrl.navy.mil>
From: Steve Weiss <srw@hvcn.org>
List: port-mac68k
Date: 02/01/1996 02:49:31
At 6:31 AM 1/30/96, The Unseen wrote:
>locks up after a few hours of use.  usually when I'm running X but
>I think it getting to be where it random.
>
>
>I was wondering if anyone else had noticed this?
>
>
>My machine:
>        Mac IIx
>        20MB RAM
>        80MB internal disk
>        100MB ZIP external
>        Apple standard video board
>        Asante Ethernet
>
>I get messsages from the ethernet card saying "ae0 timeout, recovered"
>as
>well as messages from the disk geometry of the ZIP informing me that
>netbsd will use the ficticious(sp?) geometry.  I will be upgrading the
>internal disk to 1G and kill the ZIP.  I also have an Apple Ethernet
>card
>which I know works with NetBSD 1.0 to try....
>
>any help is welcome....
>
>
>Ian
>
>E-mail: dickens@itd.nrl.navy.mil

This sounds a lot like the problem I am having. I reported it and Allen
Briggs wrote me last week to say that he had seen it before, but not been
able to reproduce the problem. The good/bad news is that I can reproduce
this problem reliably in less than 5 minutes every time. I think the disk
geometry is a red herring.

The MTTF is a function of the network traffic. My network is very chatty,
with multiple MS mail servers and Novell File servers talking IPX, and also
a bunch of desktops and unix using IP, including a new ISDN connected
firewall to the internet that has added a lot of web traffic to the IP. To
add insult to this injury, there is a chatty vax cluster on the segment,
too.   :-(

Anyway, my MTTF is about 3 minutes, regardless of activity on the machine.
It does not seem to matter that the network traffic is not aimed at my Mac.
With the network unplugged from the interface there is no failure. The
system works just fine up until the failure. The triggering of the failure
does appear to me to be random.

This has afforded me the opportunity of being able to try a lot of
different things in the kernel debugger attempting to learn what I could
about this.  I have been using the programmers switch (aka panic button)
liberally, and I have  discovered some things that may be of interest.

As a work-around (no, better call this a work-through) try this. When it
locks up, hit the panic button, which drops into the kernel debugger. Step
(using "s") out of _Debugger and then out of _nmihand and then out of
_lev7intr. You will find yourself at the beginning of "_rei". Type "c" to
continue at this point. If you are experiencing what I am, your machine
will spring back to life with a few log messages thrown in. Mine lasts
another 3 minutes, you may last an hour. Repeat upon hang as necessary
until done.

Not elegant, no. Useless for me, but you may find it handy.

--

The theory on this bug is that the nubus interrupts are not being processed
efficiently enough, and that the network traffic is "wedging" the cpu. I
think this theory  is on the right track, but it needs some adjustment. If
it were "wedging" the cpu in this way, why doesn't it come out of the hang
when the network traffic inevitable declines? Trust me. it does not.

I have taken to reading the source for the ae driver, which is in
mac68k/dev/if_ae.c   I believe that the cpu is spinning in the _aeintr
routine, including going into _aestart and back out, but for some reason
never clearing the last packet out of the ring buffer. I'm convinced it is
a software bug in the driver, but don't understand the hardware and its
needs well enough to do much more than be the eyes of someone who would be
able to resolve the problem.

I have repeatedly stepped through _rei to come to the actual spot in the
code where I believe the code was running when I hit the panic button. This
spot is distributed in my samples from _aeintr+3e through _aeintr+158, as
well as appearing in various parts of _aestart as called at _aeintr+114. It
is never from anyplace else. In fact it is only found from _aeintr+3e to
+5a, and from +d0 to +d6, and from +112 to +158 and in various parts of
_aestart.
I took over 25 data points.

This is consistent with the theory that it is wedging the cpu, but the
question I still ask is why is this permanent? and also why does the
introduction of a panic button interrupt break out of the cycle? Answering
these questions correctly will lead to a solution, IMO.

---

My machine: IIci 8MB Ram
            80MB internal SCSI HD: 59MB Root+Usr + 16MB swap + 5MB MacOS
            Internal video
            Apple Ethernet 32k (or maybe 64k) interface board.

Symptoms are the same whether I run a late December kernel, or the
"for.kevin" kernel in Allen's outgoing directory on Puma that is supposed
to have slightly more efficient nubus interrupts.

-srw
_________________________________________________________________________
 Steven R. Weiss   |    srw@hvcn.org      | If you think education is
Computer Scientist | (313) 995-8250 x5632 |  expensive, try ignorance.