Subject: Re: wi0 going mute?
To: David Young <dyoung@pobox.com>
From: Paul Ripke <stix@stix.homeunix.net>
List: current-users
Date: 01/26/2004 13:20:49
On Monday, Jan 26, 2004, at 07:01 Australia/Sydney, David Young wrote:

> On Sun, Jan 25, 2004 at 11:21:36PM +1100, Paul Ripke wrote:
>> With -current kernel from around 20040119, userland somewhat
>> more ancient (about 200304??), after giving my system some work,
>> like building a release, its wireless interface has a nasty
>> habit of going mute. Packets are still being received, but
>> nothing sent.
>
> I have seen this, too. I think it is a bug in the rate adaptation 
> code. I
> don't see what it would have to do with building a release, unless the
> kernel misses interrupts under those conditions....

I'm guessing it is missing interrupts. Not sure which device is
holding things up, but I know wi0 will drop more pings when I'm
reading/writing from/to a SCSI tape drive. I also notice the
more interrupt load the system is under, the client machine,
a Mac OS X 10.2.x laptop with DWL-650 card running
"WirelessDriver" from sf.net, occasionally the status LED starts
blinking as if its lost carrier.

Is there anything useful I can look at in gdb when the system
gets into this state? Gives me an excuse to upgrade userland.

>> thing I've noticed is that the kernel prints a couple of
>> "wi0: bad idx 6e"
>
> Hmm. This could be an endianness issue. What architecture is this?  I 
> am
> betting that it is big-endian.  0x6e (110) is Prism's representation 
> for
> 11Mbps. The data rate and the index both go into the same 16-bit word,
> but they are u_int8_t's in the wi_frame struct.

Bzzt. Nice theory, but this is on x86. How much do I win? :)
Seriously, would adding some code to dump out more of the frame
in the "bad idx" case help?

Ahah! Just did some more playing... I can now repeat the muting
at will...
Kicking off a "ttcp -s -t -b 65536 -l 65536" over a 100 MBit LAN
(tlp0 interface) resulted in the following pings over wi0 - note
that ttcp only ran for 11 seconds, and the ping was running at
1 second intervals, from laptop to NetBSD box:

64 bytes from 192.168.254.129: icmp_seq=429 ttl=255 time=4.95 ms
64 bytes from 192.168.254.129: icmp_seq=430 ttl=255 time=4.798 ms
64 bytes from 192.168.254.129: icmp_seq=431 ttl=255 time=5.011 ms
64 bytes from 192.168.254.129: icmp_seq=440 ttl=255 time=4.288 ms
64 bytes from 192.168.254.129: icmp_seq=441 ttl=255 time=1005 ms
64 bytes from 192.168.254.129: icmp_seq=442 ttl=255 time=3008.79 ms
64 bytes from 192.168.254.129: icmp_seq=443 ttl=255 time=3008.1 ms
64 bytes from 192.168.254.129: icmp_seq=444 ttl=255 time=3007.73 ms
64 bytes from 192.168.254.129: icmp_seq=445 ttl=255 time=3007.37 ms
64 bytes from 192.168.254.129: icmp_seq=446 ttl=255 time=2877.65 ms
64 bytes from 192.168.254.129: icmp_seq=447 ttl=255 time=2006.51 ms
64 bytes from 192.168.254.129: icmp_seq=448 ttl=255 time=2006.16 ms
64 bytes from 192.168.254.129: icmp_seq=449 ttl=255 time=2874.81 ms
64 bytes from 192.168.254.129: icmp_seq=450 ttl=255 time=2026.97 ms
64 bytes from 192.168.254.129: icmp_seq=451 ttl=255 time=2007.26 ms
64 bytes from 192.168.254.129: icmp_seq=452 ttl=255 time=2006.23 ms
64 bytes from 192.168.254.129: icmp_seq=453 ttl=255 time=2008.94 ms
64 bytes from 192.168.254.129: icmp_seq=454 ttl=255 time=3874.05 ms
64 bytes from 192.168.254.129: icmp_seq=455 ttl=255 time=6010.24 ms
64 bytes from 192.168.254.129: icmp_seq=456 ttl=255 time=6009.22 ms
64 bytes from 192.168.254.129: icmp_seq=457 ttl=255 time=6009.53 ms
64 bytes from 192.168.254.129: icmp_seq=458 ttl=255 time=7009.32 ms
64 bytes from 192.168.254.129: icmp_seq=459 ttl=255 time=7009.93 ms
64 bytes from 192.168.254.129: icmp_seq=460 ttl=255 time=7069.77 ms
64 bytes from 192.168.254.129: icmp_seq=461 ttl=255 time=7009.68 ms
64 bytes from 192.168.254.129: icmp_seq=462 ttl=255 time=6126.11 ms
64 bytes from 192.168.254.129: icmp_seq=463 ttl=255 time=6010.1 ms
64 bytes from 192.168.254.129: icmp_seq=464 ttl=255 time=5123.26 ms
64 bytes from 192.168.254.129: icmp_seq=465 ttl=255 time=5008.62 ms
64 bytes from 192.168.254.129: icmp_seq=466 ttl=255 time=5010.2 ms
64 bytes from 192.168.254.129: icmp_seq=467 ttl=255 time=5009.07 ms
64 bytes from 192.168.254.129: icmp_seq=468 ttl=255 time=7011.13 ms
64 bytes from 192.168.254.129: icmp_seq=469 ttl=255 time=9012.41 ms
64 bytes from 192.168.254.129: icmp_seq=470 ttl=255 time=11100.9 ms
64 bytes from 192.168.254.129: icmp_seq=471 ttl=255 time=11013.4 ms
64 bytes from 192.168.254.129: icmp_seq=472 ttl=255 time=10097.6 ms
64 bytes from 192.168.254.129: icmp_seq=473 ttl=255 time=10024.5 ms
64 bytes from 192.168.254.129: icmp_seq=474 ttl=255 time=9190.56 ms
64 bytes from 192.168.254.129: icmp_seq=475 ttl=255 time=8301.99 ms
64 bytes from 192.168.254.129: icmp_seq=476 ttl=255 time=7479.14 ms
64 bytes from 192.168.254.129: icmp_seq=477 ttl=255 time=6716.77 ms
64 bytes from 192.168.254.129: icmp_seq=478 ttl=255 time=5968.18 ms
64 bytes from 192.168.254.129: icmp_seq=479 ttl=255 time=5090.35 ms
64 bytes from 192.168.254.129: icmp_seq=480 ttl=255 time=5179.41 ms
64 bytes from 192.168.254.129: icmp_seq=481 ttl=255 time=5095.31 ms
64 bytes from 192.168.254.129: icmp_seq=482 ttl=255 time=6033.54 ms
64 bytes from 192.168.254.129: icmp_seq=483 ttl=255 time=7163.45 ms
64 bytes from 192.168.254.129: icmp_seq=484 ttl=255 time=10091.2 ms
64 bytes from 192.168.254.129: icmp_seq=485 ttl=255 time=10080.8 ms
64 bytes from 192.168.254.129: icmp_seq=486 ttl=255 time=11080.9 ms
64 bytes from 192.168.254.129: icmp_seq=487 ttl=255 time=13082.2 ms
64 bytes from 192.168.254.129: icmp_seq=488 ttl=255 time=13489.1 ms
64 bytes from 192.168.254.129: icmp_seq=489 ttl=255 time=13014 ms
64 bytes from 192.168.254.129: icmp_seq=490 ttl=255 time=13039.2 ms
64 bytes from 192.168.254.129: icmp_seq=491 ttl=255 time=13534.8 ms
64 bytes from 192.168.254.129: icmp_seq=492 ttl=255 time=13015.9 ms
64 bytes from 192.168.254.129: icmp_seq=493 ttl=255 time=13013.6 ms
64 bytes from 192.168.254.129: icmp_seq=494 ttl=255 time=13549.4 ms
64 bytes from 192.168.254.129: icmp_seq=495 ttl=255 time=13014.3 ms
64 bytes from 192.168.254.129: icmp_seq=496 ttl=255 time=13218 ms
64 bytes from 192.168.254.129: icmp_seq=497 ttl=255 time=13577.8 ms
64 bytes from 192.168.254.129: icmp_seq=498 ttl=255 time=12827.5 ms

After a while it died completely. Looks like a transmit queue is
building up, that only gets touched on receipt of an interrupt.
Once it gets behind, it has a hard time catching up. Guessing it
is a missed interrupt that starts the landslide. This is all
plain routing too, no bridging. The "build release" may have been
a furphy - it was probably more to do with my using another box
with NFS mounts over the LAN.

BTW: Full dmesg can be had from:
<http://stix.homeunix.net:8080/about/dmesg-stix-pc.txt>
The system is somewhat stretched - not that I'm making excuses
for it, this still shouldn't happen.

Cheers,
--
Paul Ripke
Unix/OpenVMS/TSM/DBA
I love deadlines. I like the whooshing sound they make as they fly by.
-- Douglas Adams