tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: DIAGNOSTIC panic in unp_gc function



On 2/27/15 7:59 PM, Christos Zoulas wrote:
In article <54F06A82.2030109%teznetworks.com@localhost>,
Ritesh Agrawal  <ritesh.agrawal%teznetworks.com@localhost> wrote:
Hi All,

I have been seeing random panics on my NetBSD-6.0 based system and this
panic was in AF_LOCAL protocol code (uipc_usrreq.c). One of the panic was:

uvm_fault(0xffffffff81a195e0, 0x0, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff803042ff cs 8 rflags 10283 cr2  8 cpl 0 rsp
fffffe810f27bab0
panic: trap
cpu1: Begin traceback...
printf_nolog() at netbsd:printf_nolog
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
unp_detach() at netbsd:unp_detach+0x2e
uipc_usrreq() at netbsd:uipc_usrreq+0x79
soclose() at netbsd:soclose+0x79
soo_close() at netbsd:soo_close+0x1a
closef() at netbsd:closef+0x4a
unp_thread() at netbsd:unp_thread+0x3cb
cpu1: End traceback...

I than installed the NetBSD kernel with "DIAGNOSTIC option" and the
DIAGNOSTIC kernel paniced on following line no:

sys/kern/uipc_usrreq.c:1713 with current TOT for MAIN branch.

It seems that we can get into this code with file pointer reference
count as 0.
I got into this situation by following steps:

1) I am passing file between two processes using AF_LOCAL socket.
2) process 1 opens a AF_LOCAL socket with fd "x"
3) now process 1 passes the same fd as part of control message to
process 2 using this fd.
4) process 1 closes its fd
5) This fd is now only in kernel as part of control message of mbuf
receive Q for the AF_LOCAL socket held by process 2
6) Therefore "fp->f_count" of the passed fd is 1
7) Now unp_thread kicks in to process the deferred closed
8) It looks into the filehead list and for UNIX domain socket scans the
receive mbufs
9) It marks all the file descriptor which are socket in control message
as "FDEFER"
10) Therefore the file descriptor sent by process 1 is marked as FDEFER
11) Before the "filehead" list is rescanned again (FDEFER fd is ahead in
list), process 2 wakes up and receives the mbuf and closes the FDEFER
file descriptor
12) As there was only one reference count on this file descriptor, the
file is put into "file_cache" pool cache when it is closed by process 2.
13) This file is freed but it still remains in "filehead" list because
file is removed from "filehead" list in "file_dtor" function.
14) "file_dtor" function is "pc_dtor" function and are called
conditionally.
15) Now the "unp_gc" rescans the "filehead" and finds this file
descriptor with "FDEFER" set and file pointer "f_count" as 0
16) It hits KASSERT and system panics
17) If it is not DIAGNOSTIC kernel then it access data which are not
valid and crashes elsewhere.

I saw the current code and this code still exist and looks similar. I
think we should either increase the f_count of this file pointer while
marking it FDEFER and then cleanup when we get into this loop. We can
also check for file pointer with "FDEFER" and "f_count == 0" in line
1713 and just "continue" as this could be valid case.
Yes, I've seen that too but it is rare. Can you make an example program
that triggers it?
I agree, this is not easily reproducible but I am able to reproduce this in my setup using in house developed automation tool. It occurs in specific condition and I am not sure whether I will be able to write a sample code which can reproduce this issue at will. And might need a complete infrastructure and similar setup to reproduce this. We will need to simulate closing of socket by process 2 when the "unp_gc" has marked the socket as "FDEFER". We hit upon this issue by mistake, we were suppose to pass the regular file descriptor but due to some bug in code the original file descriptor got closed. And the communicating AF_LOCAL socket took the same fd number and got passed along the message. Do you think how we should fix this issue as we can definitely reproduce this in closed environment?
I put a check instead of "KASSERT(fp->f_count != 0);" and did continue.

1713c1713,1716
<                 KASSERT(fp->f_count != 0);
---
>                 if (fp->f_count == 0) {
>                     mutex_exit(&fp->f_lock);
>                     continue;
>                 }

But I know this is not a better method to fix it and we will need a better fix for this problem.

Ritesh


Home | Main Index | Thread Index | Old Index