Re: lockups on 6.0.2 - progress?

To: port-macppc%NetBSD.org@localhost
Subject: Re: lockups on 6.0.2 - progress?
From: Donald Lee <MacPPC2%c.icompute.com@localhost>
Date: Sat, 1 Jun 2013 18:51:39 -0500

Unfortunately, my experiment was not successful.  By running my
test case using the rtk card/driver for networking, it crashed after a
few hours with the same symptoms.  This suggests that it is NOT
the gem driver (or at least not *just* the gem driver)

I am encouraged about the fact that I can make it fail, but at
a loss as to where to go from here.

I don't have the expertise to debug the kernel, but I can run test
cases and report back.  I can also provide accounts on my test machine,
which is on the net (charm.icompute.com)

I've been trying to think of a way to run the test case without *any*
network, but I don't know how useful that would be.  I'll run a test tonight
with the wget script running on the same machine and using localhost.
If it crashes the same way, that would tend to rule out the ethernet
drivers.

-dgl-

>Hi,
>
>I reported problems with gem(4) on macppc as a bug (kern/46083).
>As the system board is now broken, I can no longer test myself
>(or confirm that it's something related to the driver or an
>already broken board;  at least it was running with NetBSD 5.x
>and Linux without problems while on -6 I had an unstable gem(4)).
>
>Maybe both problems are related, even though I didn't see much
>output.  Running makemandb over nfs was enough to break gem(4)
>connection.  If you think it may be related, it might help
>to combine both bug-reports in a single PR (or if you haven't
>added any PR so far, add your experiences to the kern/46083).
>
>--
>Regards
>Matthias Kretschmer
>
>
>On Fri, May 31, 2013 at 09:03:42PM -0500, Donald Lee wrote:
>> I have been chasing lockups of NetBSD 6.0.1, and recently tried 6.0.2, and
>> have found that it locks up, too.  My problem is that this is intermittent,
>> so the first task is to find a failing test case.
>> 
>> I have a second machine set up that has hung up 3 times, twice with 6.0.2, 
>> and
>> once with 6.0.1.  The interesting difference is this i the log:
>> 
>> May 29 13:00:00 charm syslogd[151]: restart
>> May 29 21:52:13 charm /netbsd: arp info overwritten for 71.39.101.62 by 
>> 20:76:00:10:7f:14
>> May 30 14:44:08 charm /netbsd: gem0: receive error: RX overflow sc->rxptr 
>> 75, complete 82
>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: not in overflow state: 
>> 0x810400
>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>> May 30 14:44:12 charm /netbsd: gem0: resetting anyway
>> May 30 15:01:45 charm /netbsd: gem0: receive error: RX overflow sc->rxptr 
>> 20, complete 30
>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: not in overflow state: 
>> 0x810400
>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>> May 30 15:01:49 charm /netbsd: gem0: resetting anyway
>> May 30 18:15:30 charm /netbsd: gem0: receive error: RX overflow sc->rxptr 
>> 58, complete 70
>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: not in overflow state: 
>> 0x810400
>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>> May 30 18:15:34 charm /netbsd: gem0: resetting anyway
>> May 31 20:51:35 charm syslogd[151]: restart
>> 
>> 
>> I take this as a clue, and I am going to put in a PCI ethernet card, (SMC)
>> and see if that behaves differently.
>> 
>> Note that this message the "watchdog" thing with the reset is new in 6.0.2,
>> so I'm guessing that someone changed the gem driver - just a guess....
>> 
>> I'll report back.
>> 
>> It takes a day or two or three for the failure to occur.  I originally
>> thought it was a failure that happened under heavy disk load, but it
>> turns out that at least with the last couple of failures, it happens
>> on an almost idle machine.  The only "load" I have on it is a script that
>> does two wget's in a loop.  One wget is of a small index file, and the other 
>> is
>> of a 1 Meg file.  It does the wgets as fast as it can.  It seems to cause the
>> problem in a couple of days.
>> 
>> I have now swapped in the SMC ethernet card.  Let's see if it still fails.
>> If not, then I have a workaround, and we have a possible driver bug to
>> fix.
>> 
>> -dgl-

References:
- lockups on 6.0.2 - progress?
  - From: Donald Lee

Prev by Date: lockups on 6.0.2 - progress?
Next by Date: Re: lockups on 6.0.2 - progress?
Previous by Thread: lockups on 6.0.2 - progress?
Next by Thread: lockups on 6.0.2 - progress?
Indexes:

Home | Main Index | Thread Index | Old Index