NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: FreeRADIUS instability



On 9/29/21 09:09, Pawel S. Veselov wrote:

Yes, the question is what happened to fd#3 (presumably the kqueue).
If you can get into the debugger (gdb <radiusd> <pid>) and look at
queue call and see what fd is passed to it?
It's still fd#3

What we have determined from tracing the process that fd#3 is just
being closed and then re-opened as another kqueue (due to fd reuse)
that radius then tries to keep using as its own, but since none of
its filters are there, the process is effectively dead.

So we caught where the queue is closed, and traced it back to
getaddrinfo(). That call both closes fd#3, creates a new kqueue
and leaves it open. This is the back trace from close:

#0  0x0000732d69c07892 in close () from /usr/lib/libpthread.so.1
#1  0x0000732d68f25da9 in __res_ndestroy () from /usr/lib/libc.so.12
#2  0x0000732d68f2676b in __res_vinit () from /usr/lib/libc.so.12
#3  0x0000732d68f26bef in __res_check () from /usr/lib/libc.so.12
#4  0x0000732d68f22220 in __res_nsend () from /usr/lib/libc.so.12
#5  0x0000732d68f2719c in ?? () from /usr/lib/libc.so.12
#6  0x0000732d68f27420 in ?? () from /usr/lib/libc.so.12
#7  0x0000732d68f2a5a9 in ?? () from /usr/lib/libc.so.12
#8  0x0000732d68f2a8bd in ?? () from /usr/lib/libc.so.12
#9  0x0000732d68f3ed49 in nsdispatch () from /usr/lib/libc.so.12
#10 0x0000732d68f286c8 in getaddrinfo () from /usr/lib/libc.so.12

The full stack traces and ktraces can be found here:

https://github.com/FreeRADIUS/freeradius-server/issues/4244

I have an idea of what's going on. AFAIU, libc maintains a kqueue
for issuing DNS requests. kqueues are not inherited on fork.
If the parent process calls getaddrinfo(), that creates an internal
DNS kqueue in its address space, assigned to a FD (let's say 3).

After fork() the child process will have that FD 3 as unused, let's
say the child immediately opens something permanent, which is
assigned FD 3.

Then the child calls getaddrinfo(). Now, the internal
state of the resolver still has this statp object that references
FD 3 (I don't believe it's cleaned up after fork), which is used by
the application, and the obvious collision occurs.

From ktrace:

Parent:
(getaddrinfo or such)
 28913      1 radiusd  1632165412.994373444 CALL  kqueue1(0x400000)
 28913      1 radiusd  1632165412.994374612 RET   kqueue1 3
... parent never closes 3
(fork)
 28913      1 radiusd  1632165413.001116635 CALL  fork
 28913      1 radiusd  1632165413.001356463 RET   fork 16226/0x3f62
(child creates its own kqueue)
 16226      1 radiusd  1632165413.002185215 CALL  kqueue
 16226      1 radiusd  1632165413.002186171 RET   kqueue 3
(child calls getaddrinfo, telltale is reading /etc/hosts)
 16226      1 radiusd  1632397379.465012449 GIO   fd 15 read 731 bytes
       "#	$NetBSD: hosts,v 1.9 2013/11/24 07:20:01 dholland Exp
16226 1 radiusd 1632397379.465033818 CALL __gettimeofday50(0x7f7fff62e700,0)
(resolver uses FD 3 as its own, reading from it and closing it)
 16226      1 radiusd  1632397379.465034253 RET   __gettimeofday50 0
16226 1 radiusd 1632397379.465036295 CALL __kevent50(3,0,0,0x7f7fff62e110,1,0x74fd7b7787b0)
 16226      1 radiusd  1632397379.465037310 RET   __kevent50 1
 16226      1 radiusd  1632397379.465043316 CALL  close(3)

I think the only way to fix this is to have the resolver state
cleaned up thoroughly after fork(). I can't see how this can be
worked around by applications.

Thank you.


Home | Main Index | Thread Index | Old Index