NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-sparc64/51647: deadlock in NFS client with UDP (IPv4) on sparc64



The following reply was made to PR port-sparc64/51647; it has been noted by GNATS.

From: Rin Okuyama <rokuyama%rk.phys.keio.ac.jp@localhost>
To: matthew green <mrg%eterna.com.au@localhost>, Martin Husemann <martin%duskware.de@localhost>,
 gnats-bugs%NetBSD.org@localhost
Cc: 
Subject: Re: port-sparc64/51647: deadlock in NFS client with UDP (IPv4) on
 sparc64
Date: Sun, 27 Nov 2016 16:04:48 +0900

 Thank both of you for useful comments.
 
 On 2016/11/25 5:48, matthew green wrote:
 > i've never been able to use the gem driver fully stable on sparc64.
 > for non-nfs root it would eventually soft hang and need an
 > 'ifconfig gem0 down up' to bring it back.  for nfs root, this usually
 > meant a soft hang i couldn't recover from (soft hang, meaning that i
 > could enter ddb.)  i've lost what trust in the the re(4) hardware i
 > had after the 4th or 5th card failed after 3 years of service.
 >
 > i have successfully use wm(4), hme(4) and, in -current, bge(4) on
 > sparc64 and nfs.  oh, and le(4) but that was long ago :)
 
 Well, it is possible that both gem(4) and re(4) are broken. To verify
 this possibility, I did two additional tests, whose results seem to
 contradict to each other at a glance:
 
 (1) I installed the same re(4) card into an alpha box. Then, it did
 work well with NFS root.
 
 (2) I tried another ethernet adapter, axe(4), on sparc64:
 
    axe0 at uhub0 port 4
    axe0: vendor 04bb product 0930, rev 2.00/0.01, addr 2
    axe0: Ethernet address xx:xx:xx:xx:xx:xx
    ukphy1 at axe0 phy 24: OUI 0x00c08f, model 0x000b, rev. 1
    ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
 
 This is a gigabit ethernet adapter. However, since it is attached to
 ohci(4), its performance should be significantly suppressed.
 
 Then, I did "cd /usr/pkgsrc/lang/perl5 && make". I also observed an
 error message:
 
    nfs server cubietruck:/exports/sb100: not responding
 
 But this time, it immediately got sane with the following message:
 
    nfs server cubietruck:/exports/sb100: is alive again
 
 I wonder how to interpret these results. Race conditions are avoided
 just because axe(4) is too slow?
 
 On 2016/11/24 6:30, Martin Husemann wrote:
 > We have to look closer at packet traces in the locked up state to find out
 > whether the fault is at the server or the client side. It is unlikely that
 > this is a sparc64 MD problem (but not impossible).
 
 This is another possibility. I did tcpdump(8) on the server side:
 
      192.168.10.132.1023 > 192.168.10.128.shilp: NFS request xid 3851049355 1472 write fh 16,0/745343735 8192 (8192) bytes @ 16384 <filesync>
 15:51:53.009636 IP (tos 0x0, ttl 64, id 45280, offset 1480, flags [+], proto UDP (17), length 1500)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.009671 IP (tos 0x0, ttl 64, id 45280, offset 2960, flags [+], proto UDP (17), length 1500)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.009706 IP (tos 0x0, ttl 64, id 45280, offset 4440, flags [+], proto UDP (17), length 1500)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.009751 IP (tos 0x0, ttl 64, id 45280, offset 5920, flags [+], proto UDP (17), length 1500)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.009765 IP (tos 0x0, ttl 64, id 45280, offset 7400, flags [none], proto UDP (17), length 960)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.011171 IP (tos 0x0, ttl 64, id 8800, offset 0, flags [none], proto UDP (17), length 188)
      192.168.10.128.shilp > 192.168.10.132.1023: NFS reply xid 3851049355 reply ok 160 write PRE: sz 24576 mtime 1480229513.8255089 ctime 1480229513.8255089 POST: REG 600 ids 0/0 sz 24576 nlink 1 rdev 4095/1048575 fsid 1000 fileid 988a5d a/m/ctime 1480229512.987088165 1480229513.10505140 1480229513.10505140 8192 bytes <filesync>
 15:51:53.011590 IP (tos 0x0, ttl 64, id 45281, offset 0, flags [+], proto UDP (17), length 1500)
      192.168.10.132.1023 > 192.168.10.128.shilp: NFS request xid 3851049356 1472 write fh 16,0/745343735 4441 (4441) bytes @ 24576 <filesync>
 15:51:53.011604 IP (tos 0x0, ttl 64, id 45281, offset 1480, flags [+], proto UDP (17), length 1500)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.011635 IP (tos 0x0, ttl 64, id 45281, offset 2960, flags [+], proto UDP (17), length 1500)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.011646 IP (tos 0x0, ttl 64, id 45281, offset 4440, flags [none], proto UDP (17), length 172)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.074605 IP (tos 0x0, ttl 64, id 45282, offset 0, flags [+], proto UDP (17), length 1500)
      192.168.10.132.1023 > 192.168.10.128.shilp: NFS request xid 3851049356 1472 write fh 16,0/745343735 4441 (4441) bytes @ 24576 <filesync>
 15:51:53.074620 IP (tos 0x0, ttl 64, id 45282, offset 1480, flags [+], proto UDP (17), length 1500)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.074634 IP (tos 0x0, ttl 64, id 45282, offset 2960, flags [+], proto UDP (17), length 1500)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.074644 IP (tos 0x0, ttl 64, id 45282, offset 4440, flags [none], proto UDP (17), length 172)
      192.168.10.132 > 192.168.10.128: udp
 15:51:53.204717 IP (tos 0x0, ttl 64, id 45283, offset 0, flags [+], proto UDP (17), length 1500)
 
 The server responded to request until xid 3851049355, however it never
 respond to xid 3851049356. How can I investigate further? Inserting
 debugging code to server/client? Which variable should I check then?
 
 Thanks,
 Rin
 


Home | Main Index | Thread Index | Old Index