NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: port-sparc64/51647: deadlock in NFS client with UDP (IPv4) on sparc64
The following reply was made to PR port-sparc64/51647; it has been noted by GNATS.
From: Rin Okuyama <rokuyama%rk.phys.keio.ac.jp@localhost>
To: matthew green <mrg%eterna.com.au@localhost>, Martin Husemann <martin%duskware.de@localhost>,
gnats-bugs%NetBSD.org@localhost
Cc:
Subject: Re: port-sparc64/51647: deadlock in NFS client with UDP (IPv4) on
sparc64
Date: Sun, 27 Nov 2016 16:04:48 +0900
Thank both of you for useful comments.
On 2016/11/25 5:48, matthew green wrote:
> i've never been able to use the gem driver fully stable on sparc64.
> for non-nfs root it would eventually soft hang and need an
> 'ifconfig gem0 down up' to bring it back. for nfs root, this usually
> meant a soft hang i couldn't recover from (soft hang, meaning that i
> could enter ddb.) i've lost what trust in the the re(4) hardware i
> had after the 4th or 5th card failed after 3 years of service.
>
> i have successfully use wm(4), hme(4) and, in -current, bge(4) on
> sparc64 and nfs. oh, and le(4) but that was long ago :)
Well, it is possible that both gem(4) and re(4) are broken. To verify
this possibility, I did two additional tests, whose results seem to
contradict to each other at a glance:
(1) I installed the same re(4) card into an alpha box. Then, it did
work well with NFS root.
(2) I tried another ethernet adapter, axe(4), on sparc64:
axe0 at uhub0 port 4
axe0: vendor 04bb product 0930, rev 2.00/0.01, addr 2
axe0: Ethernet address xx:xx:xx:xx:xx:xx
ukphy1 at axe0 phy 24: OUI 0x00c08f, model 0x000b, rev. 1
ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
This is a gigabit ethernet adapter. However, since it is attached to
ohci(4), its performance should be significantly suppressed.
Then, I did "cd /usr/pkgsrc/lang/perl5 && make". I also observed an
error message:
nfs server cubietruck:/exports/sb100: not responding
But this time, it immediately got sane with the following message:
nfs server cubietruck:/exports/sb100: is alive again
I wonder how to interpret these results. Race conditions are avoided
just because axe(4) is too slow?
On 2016/11/24 6:30, Martin Husemann wrote:
> We have to look closer at packet traces in the locked up state to find out
> whether the fault is at the server or the client side. It is unlikely that
> this is a sparc64 MD problem (but not impossible).
This is another possibility. I did tcpdump(8) on the server side:
192.168.10.132.1023 > 192.168.10.128.shilp: NFS request xid 3851049355 1472 write fh 16,0/745343735 8192 (8192) bytes @ 16384 <filesync>
15:51:53.009636 IP (tos 0x0, ttl 64, id 45280, offset 1480, flags [+], proto UDP (17), length 1500)
192.168.10.132 > 192.168.10.128: udp
15:51:53.009671 IP (tos 0x0, ttl 64, id 45280, offset 2960, flags [+], proto UDP (17), length 1500)
192.168.10.132 > 192.168.10.128: udp
15:51:53.009706 IP (tos 0x0, ttl 64, id 45280, offset 4440, flags [+], proto UDP (17), length 1500)
192.168.10.132 > 192.168.10.128: udp
15:51:53.009751 IP (tos 0x0, ttl 64, id 45280, offset 5920, flags [+], proto UDP (17), length 1500)
192.168.10.132 > 192.168.10.128: udp
15:51:53.009765 IP (tos 0x0, ttl 64, id 45280, offset 7400, flags [none], proto UDP (17), length 960)
192.168.10.132 > 192.168.10.128: udp
15:51:53.011171 IP (tos 0x0, ttl 64, id 8800, offset 0, flags [none], proto UDP (17), length 188)
192.168.10.128.shilp > 192.168.10.132.1023: NFS reply xid 3851049355 reply ok 160 write PRE: sz 24576 mtime 1480229513.8255089 ctime 1480229513.8255089 POST: REG 600 ids 0/0 sz 24576 nlink 1 rdev 4095/1048575 fsid 1000 fileid 988a5d a/m/ctime 1480229512.987088165 1480229513.10505140 1480229513.10505140 8192 bytes <filesync>
15:51:53.011590 IP (tos 0x0, ttl 64, id 45281, offset 0, flags [+], proto UDP (17), length 1500)
192.168.10.132.1023 > 192.168.10.128.shilp: NFS request xid 3851049356 1472 write fh 16,0/745343735 4441 (4441) bytes @ 24576 <filesync>
15:51:53.011604 IP (tos 0x0, ttl 64, id 45281, offset 1480, flags [+], proto UDP (17), length 1500)
192.168.10.132 > 192.168.10.128: udp
15:51:53.011635 IP (tos 0x0, ttl 64, id 45281, offset 2960, flags [+], proto UDP (17), length 1500)
192.168.10.132 > 192.168.10.128: udp
15:51:53.011646 IP (tos 0x0, ttl 64, id 45281, offset 4440, flags [none], proto UDP (17), length 172)
192.168.10.132 > 192.168.10.128: udp
15:51:53.074605 IP (tos 0x0, ttl 64, id 45282, offset 0, flags [+], proto UDP (17), length 1500)
192.168.10.132.1023 > 192.168.10.128.shilp: NFS request xid 3851049356 1472 write fh 16,0/745343735 4441 (4441) bytes @ 24576 <filesync>
15:51:53.074620 IP (tos 0x0, ttl 64, id 45282, offset 1480, flags [+], proto UDP (17), length 1500)
192.168.10.132 > 192.168.10.128: udp
15:51:53.074634 IP (tos 0x0, ttl 64, id 45282, offset 2960, flags [+], proto UDP (17), length 1500)
192.168.10.132 > 192.168.10.128: udp
15:51:53.074644 IP (tos 0x0, ttl 64, id 45282, offset 4440, flags [none], proto UDP (17), length 172)
192.168.10.132 > 192.168.10.128: udp
15:51:53.204717 IP (tos 0x0, ttl 64, id 45283, offset 0, flags [+], proto UDP (17), length 1500)
The server responded to request until xid 3851049355, however it never
respond to xid 3851049356. How can I investigate further? Inserting
debugging code to server/client? Which variable should I check then?
Thanks,
Rin
Home |
Main Index |
Thread Index |
Old Index