Re: lockups on 6.0.2 - progress?

To: port-macppc%NetBSD.org@localhost
Subject: Re: lockups on 6.0.2 - progress?
From: Donald Lee <MacPPC2%c.icompute.com@localhost>
Date: Fri, 7 Jun 2013 14:31:48 -0500
Following up again.....

Running my wget script to localhost instead of to ethernet produces a real
failure.  It looks to me like the 'infamous' tstile problem.  I can break into
the kernel debugger, and do a ps, and lots of processes are waiting on
"tstile".

It's not the same failure, but it is a failure.  I like to think that they may
be related.  I see some progress in that I think I can demonstrate this
over localhost, which eliminates any bugs/issues related to drivers
or network hW.

Anyone have suggestions on how I might collect more information that might
chase this to ground?

Thanks,

-dgl-

At 4:20 PM -0500 6/6/13, Donald Lee wrote:
>My environment:
>
>PowerMac G4, 896 MBytes mem.  2 ATA disks on the internal bus/ribbon.
>NetBSD 6.0.2.  Standard kernel.
>Machine name: charm.icompute.com charm
>
>I changed my test case so that the wget script runs on charm.  I disabled
>the network interfaces and set up apache to listen on localhost.
>
>The script looks like this:
>
>---
>#!/bin/ksh
>
>set -e
>
>while true ; do
>        date
>        wget -t 1 -T 8 -q -a logfile -O index.html http://127.0.0.1
>        echo -n  index
>        wget -t 1 -T 8 -q -a logfile -O text.txt http://127.0.0.1/text.txt
>        echo text
>done
>---
>
>When I run it, in a few hours the machine hangs.  It's not the hard hang
>I get when I run the script on another machine, but it is a hang.
>Without a network, I can't ping it, or ssh/telnet to it and run multiple
>windows.  All i know is that ctrl-c does not produce a prompt from the shell,
>and the script does not fail (timeout), but stops producing output.
>
>Unlike when the script runs on another machine, the keyboard does
>echo chars to the screen, but that's it.  (I think only retyrn chars are
>echoed.... have to check next time it fails.
>
>I've tried leaving differnt things running while the test is active,
>and writing output to a file.  tail -f, top, systat - all behave the same.
>When the hang comes, no response and no new shell prompt.
>
>Bottom line
>-=--=-=-=-=-=
>
>I've eliminated the network cards.  *IF* this is the the same problem, it
>looks like it's not in the drivers.
>
>-dgl-
>
>>Hi,
>>
>>I reported problems with gem(4) on macppc as a bug (kern/46083).
>>As the system board is now broken, I can no longer test myself
>>(or confirm that it's something related to the driver or an
>>already broken board;  at least it was running with NetBSD 5.x
>>and Linux without problems while on -6 I had an unstable gem(4)).
>>
>>Maybe both problems are related, even though I didn't see much
>>output.  Running makemandb over nfs was enough to break gem(4)
>>connection.  If you think it may be related, it might help
>>to combine both bug-reports in a single PR (or if you haven't
>>added any PR so far, add your experiences to the kern/46083).
>>
>>--
>>Regards
>>Matthias Kretschmer
>>
>>
>>On Fri, May 31, 2013 at 09:03:42PM -0500, Donald Lee wrote:
>>> I have been chasing lockups of NetBSD 6.0.1, and recently tried 6.0.2, and
>>> have found that it locks up, too.  My problem is that this is intermittent,
>>> so the first task is to find a failing test case.
>>> 
>>> I have a second machine set up that has hung up 3 times, twice with 6.0.2, 
>>> and
>>> once with 6.0.1.  The interesting difference is this i the log:
>>> 
>>> May 29 13:00:00 charm syslogd[151]: restart
>>> May 29 21:52:13 charm /netbsd: arp info overwritten for 71.39.101.62 by 
>>> 20:76:00:10:7f:14
>>> May 30 14:44:08 charm /netbsd: gem0: receive error: RX overflow sc->rxptr 
>>> 75, complete 82
>>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: not in overflow state: 
>>> 0x810400
>>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>>> May 30 14:44:12 charm /netbsd: gem0: resetting anyway
>>> May 30 15:01:45 charm /netbsd: gem0: receive error: RX overflow sc->rxptr 
>>> 20, complete 30
>>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: not in overflow state: 
>>> 0x810400
>>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>>> May 30 15:01:49 charm /netbsd: gem0: resetting anyway
>>> May 30 18:15:30 charm /netbsd: gem0: receive error: RX overflow sc->rxptr 
>>> 58, complete 70
>>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: not in overflow state: 
>>> 0x810400
>>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>>> May 30 18:15:34 charm /netbsd: gem0: resetting anyway
>>> May 31 20:51:35 charm syslogd[151]: restart
>>> 
>>> 
>>> I take this as a clue, and I am going to put in a PCI ethernet card, (SMC)
>>> and see if that behaves differently.
>>> 
>>> Note that this message the "watchdog" thing with the reset is new in 6.0.2,
>>> so I'm guessing that someone changed the gem driver - just a guess....
>>> 
>>> I'll report back.
>>> 
>>> It takes a day or two or three for the failure to occur.  I originally
>>> thought it was a failure that happened under heavy disk load, but it
>>> turns out that at least with the last couple of failures, it happens
>>> on an almost idle machine.  The only "load" I have on it is a script that
>>> does two wget's in a loop.  One wget is of a small index file, and the 
>>> other is
>>> of a 1 Meg file.  It does the wgets as fast as it can.  It seems to cause 
>>> the
>>> problem in a couple of days.
>>> 
>>> I have now swapped in the SMC ethernet card.  Let's see if it still fails.
>>> If not, then I have a workaround, and we have a possible driver bug to
>>> fix.
>>> 
>>> -dgl-
Prev by Date: Re: lockups on 6.0.2 - progress?
Next by Date: tstile lockups - test case
Previous by Thread: Re: lockups on 6.0.2 - progress?
Indexes:
Home | Main Index | Thread Index | Old Index