tech-net: Strange problem with bridge(4) on -CURRENT/sparc64

Subject: Strange problem with bridge(4) on -CURRENT/sparc64
To: None <port-sparc64@netbsd.org, tech-net@netbsd.org>
From: Jed Davis <jldavis+netbsdlist@cs.oberlin.edu>
List: tech-net
Date: 02/01/2003 04:51:13
I thought I'd let (possibly) relevant lists take a look at this before
sending it PR...

The uname: NetBSD rubisco 1.6M NetBSD 1.6M (RUBISCO) #2: Sun Jan 26
23:48:41 EST 2003 jdev@rubisco:/src/syssrc/compile/RUBISCO sparc64
(Although these also manifested with a 1.6K kernel)

The devices:
sbus0 at mainbus0 addr 0xfffcc000: clock = 25 MHz
ledma0 at sbus0 slot 14 offset 0x8400010: dma rev 2
le0 at ledma0 slot 14 offset 0x8c00000 vector 21 ipl 6: address 08:00:20:82:5c:42
le0: 8 receive buffers, 2 transmit buffers
qec0 at sbus0 slot 0 offset 0x20000 vector 4 ipl 2: 128K memory
qe[0-3] at qec0 slot 0 offset 0x0 rev 1 address 0a:00:20:82:5c:4[0-3]
(I've a local modification to change the address of the qe's, but
bad things still happen with it taken out.)


Problem 1: Using bridge0 to connect le0 and qe0 (the eventual hope being
to get a wi0 and use that), certain kinds of traffic cause this message:

missing buffer, no_td = X, last_td = Y

(where X and Y are each taken from the set {0,1}, and I've seen all four
combinations at various points), and then le0 winds up with the OACTIVE
flag permanently on and won't transmit (although it'll receive fine)
until the interface is taken down and up.

Traffic that can evoke this: occasionally TCP would do it, but a much
more reliable way was to do "ping -s 3000" from one side of the bridge
to the other; in about 15-16 pings the problem would surface, regardless
of whether -f was also used. "ping -s 2900" did not cause problems,
and the bridge would pass a respectable fraction of them even with
flooding enabled.  I tracked the message to this piece of code in
sys/dev/ic/am7990.c, rev 1.64:

    517                 if (tmd.tmd1_bits & LE_T1_OWN) {
    518                         ifp->if_flags |= IFF_OACTIVE;
    519                         printf("missing buffer, no_td = %d, last_td = %d\n",
    520                             sc->sc_no_td, sc->sc_last_td);
    521                 }

Commenting out the setting of IFF_OACTIVE (I figured the worst I could
do is cause a panic) caused the interface to keep working through a few
instances, with a few "le0: device timeout" thrown in, but eventually it
would get stuck and have to be reset, as above.

Kernel options DEBUG, DIAGNOSTIC, LOCKDEBUG don't give any more info.


Problem 2:  So why not bridge the qe's together and see what happens
there?  Works fine for a while, but after around a day of listening
to assorted local broadcast/multicast traffic, Things Get Strange.  I
notice this first as terminals acting oddly; I have to hit return
twice to get a command noticed, there's an extra newline between shell
prompts... it's very strange, it didn't happen before I configured
bridge0, and it doesn't happen if no bridged interface is connected
to anything.  I can't reproduce it easily, so it's a little hard to
examine; in particular I haven't yet ruled out as a factor the presence
of qe's that are up but not connected to anything.  Disabling the
LEARNING flag on the interface connected to the world doesn't help.
Now, something useful: the machine, from this state, hangs on attempts
to reboot or halt or go to single-user.  Dropping to ddb with a serial
break and giving the command "sync" gets me this (where qe2 is the
interface connected to the world):

db> sync
syncing disks... qe2: rx packet dropped
qe2: rx packet dropped
panic: ltsleep: l_stat 3 != LSONPROC
kdb breakpoint at 12116c8
Stopped in pid 0.1 (swapper) at cpu_Debugger+0x4:       nop
db> trace
ltsleep(9a22694, 11, 1261458, 0, 0, 5) at ltsleep+0x1a0
drain_output(9a22630, 1, 33d, 1e28800, 0, 1) at drain_output+0x5c
softdep_fsync_mountdev(9a22630, 30002, 0, 12744c0, 0, 1) at softdep_fsync_mountd
ev+0xc0
ffs_full_fsync(e0016e40, 1183654, 100, 1e28600, 11e298c, 1) at ffs_full_fsync+0x
38
ffs_fsync(e0016e40, 11003ec, 98, 1e28600, 1811800, 8dfa800) at ffs_fsync+0x34
VOP_FSYNC(9a22630, 1e3bf80, 0, 0, 0, 188ea88) at VOP_FSYNC+0x38
ffs_sync(0, 2, 1811800, 0, 1ec3800, 1e3bf80) at ffs_sync+0x1f4
sys_sync(188ea88, 0, 0, 18152a0, e00171f0, e0017138) at sys_sync+0xa8
vfs_shutdown(188ed58, 180e800, 1815800, 180e400, 1828a30, 180e400) at vfs_shutdo
wn+0x58
cpu_reboot(100, 0, 1828a10, 1828800, e00173a8, 0) at cpu_reboot+0x58
db_sync_cmd(12116cc, 0, 0, e0017330, 111c224, 188d2b8) at db_sync_cmd+0x10
db_command(180e4d0, 0, 188d000, 12116cc, 188d000, 188d000) at db_command+0x274
db_command_loop(12116cc, 12116cc, 188d000, 0, 0, 0) at db_command_loop+0x9c
db_trap(0, 0, 0, 0, 1800000, 1000000) at db_trap+0x194
kdb_trap(101, e0017920, 123db60, 0, ffffffffffffffff, badcafe) at kdb_trap+0x18c

trap(e0017920, 101, 12116c8, 44820006, 9ab, fffffffffff7f60f) at trap+0x238
Lslowtrap_reenter(1816800, 0, 6e0, 2, 18157c8, 18157c8) at Lslowtrap_reenter+0x7
c
zs_abort(0, 100, 1829930, 55, ffff, 1809800) at zs_abort+0x38
zstty_stint(1e590c0, 0, 11e6710, 188f018, 11e298c, 84a2a00c00000015) at zstty_st
int+0x84
zsc_intr_hard(8, 0, 39a, 188ec00, 188f000, 472a33) at zsc_intr_hard+0x64
zshard(1e59000, 11e4420, fffc3c18, ab2, ffff, 1809800) at zshard+0x1c
intr_list_handler(1e63e40, 0, e0017ed0, badcafe, 12016ec, badcafe) at intr_list_
handler+0x1c
sparc_intr_retry(0, 0, 123db60, 0, ffffffffffffffff, badcafe) at sparc_intr_retr
y+0x48
idle(9813800, 0, 18985f8, 7, 9ab, fffffffffff7f60f) at idle
mi_switch(18998f8, 0, 0, 9b7aa90, badcafe, badcafe) at mi_switch+0x208
ltsleep(0, 128, 126d1d8, 0, 0, 0) at ltsleep+0x350
sigsuspend1(126d000, 9bafc40, ffffffffffffd820, 9bac000, 0, 0) at sigsuspend1+0x
d4
sys___sigsuspend14(9813800, 9bafdd0, 9bafdc0, 11463b0, 0, 13fe88) at sys___sigsu
spend14+0x34
syscall(9bafed0, 126, 1147c00, 4046a9a0, 1815800, 13ff28) at syscall+0x8b4
syscall_setup(ffffffffffffd820, 2000, 0, ffffffffffffd820, ffffffffffffd82c, 2bf
448) at syscall_setup+0x140
db> sync
Frame pointer is at 0xe0015691
Call traceback:
12025e0(f, 180e800, 1815800, 180e400, 1828a30, 180e400, e0015751) fp = e0015751
111c234(100, 0, 1828a10, 1828800, 1157fe0, 0, e0015811) fp = e0015811
111bda0(12116cc, 0, ffffffffffffffff, e00161a0, 111c224, 188d2b8, e00158d1) fp = e00158d1
111b8d8(180e4d0, 0, 188d000, 12116cc, 188d000, 188d000, e0015a31) fp = e0015a31
1120268(12116cc, 12116cc, 188d000, 188ecf1, 188ea88, 0, e0015b11) fp = e0015b11
121153c(0, 0, 0, 0, 1800000, 1000000, e0015be1) fp = e0015be1
120d398(101, e0016790, 0, 0, 11e298c, fffffffffffbf870, e0015ca1) fp = e0015ca1
10093fc(e0016790, 101, 12116c8, 820016, e0016948, 0, e0015ee1) fp = e0015ee1
11579d4(180e800, 1, e0006000, 1894800, 1ec3800, 1e3bf80, e00160c1) fp = e00160c1

dumping to dev 7,1 offset 78333
dump panic: kernel diagnostic assertion "_simple_lock_held((&sched_lock)) == 0" failed: file "/usr/syssrc/sys/ke
rn/kern_synch.c", line 657
kdb breakpoint at 12116c8
Stopped in pid 0.1 (swapper) at cpu_Debugger+0x4:       nop
db> trace
__assert(126d7b8, 126d7d8, 291, 126d8c0, 0, 1f9c303) at __assert+0x18
wakeup(1898860, 1898970, 1898970, 1ebec50, 1e90220, fffffffffffbfebe) at wakeup+
0x34
uvm_aio_biodone(1ebec30, 215, 11d2970, 1000, 1f9dfc0, 0) at uvm_aio_biodone+0x54

biodone(1ebec30, 11e9af8, 0, 11f2cd0, 8, 12031d4) at biodone+0xe4
scsipi_complete(0, 22009, 22009, 6, 1eae000, 19) at scsipi_complete+0x490
scsipi_done(1f920f0, f80d, 1e34300, 1e8d400, 1eae000, 4) at scsipi_done+0x13c
ncr53c9x_done(1e8d800, 1f9cba0, 1087528, 11db3f4, 100, fffffffffffbf081) at ncr5
3c9x_done+0x1c4
ncr53c9x_intr(3, 11db464, 1814348, 6, fffffffffffffffb, 188d000) at ncr53c9x_int
r+0x12b8
ncr53c9x_poll(1e8d800, 185d8e0, 26d2, ffffffff21f86763, 0, 1f9c303) at ncr53c9x_
poll+0x2c
ncr53c9x_scsipi_request(1f9c2e0, f, 185d8e0, 1084088, e0015ba0, 200) at ncr53c9x
_scsipi_request+0x144
sddump(185d8e0, 3b632d, e0015c70, 1, 60, 296000) at sddump+0x1d0
pmap_dumpmmu(0, 131fd, e0015e70, 189dc00, 131fd, 0) at pmap_dumpmmu+0x130
dumpsys(185d800, 180e800, 11f3910, 180e400, 1828a30, 180e400) at dumpsys+0x160
cpu_reboot(100, 0, 1828a10, 1828800, 1157fe0, 0) at cpu_reboot+0x9c
db_sync_cmd(12116cc, 0, ffffffffffffffff, e00161a0, 111c224, 188d2b8) at db_sync
_cmd+0x10
db_command(180e4d0, 0, 188d000, 12116cc, 188d000, 188d000) at db_command+0x274
db_command_loop(12116cc, 12116cc, 188d000, 188ecf1, 188ea88, 0) at db_command_lo
op+0x9c
db_trap(0, 0, 0, 0, 1800000, 1000000) at db_trap+0x194
kdb_trap(101, e0016790, 0, 0, 11e298c, fffffffffffbf870) at kdb_trap+0x18c
trap(e0016790, 101, 12116c8, 820016, e0016948, 0) at trap+0x238
Lslowtrap_reenter(180e800, 1, e0006000, 1894800, 1ec3800, 1e3bf80) at Lslowtrap_
reenter+0x7c
panic(126d888, 1894800, 1894800, 0, 9a22630, 0) at panic+0xf4
[and then another copy of the above trace]

Further attempts to sync are similarly unsuccessful.  Which means, I
believe, that I can't get a kernel core.  So, at this point, I'm lost.
Any insight?

--Jed

-- 
<?xml version="1.0"?>  <?xml-stylesheet href="http://panix.com/~jdev/xs/txt.xsl"
type="text/xsl"?>   <sig name="Jed Davis">  <id dom="oberlin.edu" lp="sjld8197">
Student, 4th-Year</id><id dom="cs.oberlin.edu" lp="jldavis">CS Major and Student
SysAdmin</id><id dom="panix.com" lp="jdev">Panixer</id> <q href="bin.q"/> </sig>