Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: NFS related panics and hangs



[ Adding tech-kern. The relevant earlier mails start at
  http://mail-index.netbsd.org/current-users/2015/10/19/msg028233.html
  This is about a default-installed amd64 GENERIC 7.0 kernel.
  Replies are better in tech-kern, I think, so I set Reply-To
  accordingly.  ]


On Fri 23 Oct 2015 at 00:46:57 +0200, Rhialto wrote:
> This problem is very repeatable, usually within a few hours, just now it
> happened within half an hour.
> 
> It seems to me that somehow the nfs_reqq list gets corrupted. Then
> either there is a crash when traversing it in nfs_timer() (occurring in
> nfs_sigintr() due to being called with a bogus pointer), or there is a
> hang when one of the NFS requests gets lost and never retried.

Looking into this:

the occurrences of nfs_reqq are as follows:

fs/nfs/client/nfs_clvnops.c: * nfs_reqq_mtx : Global lock, protects the nfs_reqq list.

Since there is no other mention of nfs_reqq_mtx in the whole syssrc
tarball, this looks wrong.  It also immediately causes the suspicion
that the list isn't in fact protected at all.

nfs/nfs.h:extern TAILQ_HEAD(nfsreqhead, nfsreq) nfs_reqq;

nfs/nfs_clntsocket.c:         TAILQ_FOREACH(rep, &nfs_reqq, r_chain) {
nfs/nfs_clntsocket.c: TAILQ_INSERT_TAIL(&nfs_reqq, rep, r_chain);
nfs/nfs_clntsocket.c: TAILQ_REMOVE(&nfs_reqq, rep, r_chain);

Protected with

    s = splsoftnet();

for match #2 and #3 but #1 seems not protected by anything I can see
nearby. Maybe it is

    error = nfs_rcvlock(nmp, myrep);

if that makes any sense.
That function definitely does not use either splsoftnet() OR
mutex_enter(softnet_lock).

nfs/nfs_socket.c:struct nfsreqhead nfs_reqq;
nfs/nfs_socket.c:     TAILQ_FOREACH(rp, &nfs_reqq, r_chain) {
nfs/nfs_socket.c:     TAILQ_FOREACH(rep, &nfs_reqq, r_chain) {

match #3 is protected with

    mutex_enter(softnet_lock);	/* XXX PR 40491 */

but none of the others (visibly nearby).

#2 is called from nfs_receive() which uses nfs_sndlock() which also
doesn't use either splsoftnet() OR mutex_enter(softnet_lock).

nfs/nfs_subs.c:       TAILQ_INIT(&nfs_reqq);

presumably doesn't need any extra protection.

softnet_lock is allocated as

./kern/uipc_socket.c:kmutex_t   *softnet_lock;
./kern/uipc_socket.c:   softnet_lock = mutex_obj_alloc(MUTEX_DEFAULT, IPL_NONE);

IPL_NONE seems inconsistent with splsoftnet().

I never studied the inner details of kernel locking, but the diversity
of protections of this list doesn't inspire trust at first sight...

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl    -- 'this bath is too hot.'

Attachment: signature.asc
Description: PGP signature



Home | Main Index | Thread Index | Old Index