NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: FreeRADIUS instability



On Tue, Sep 14, 2021 at 06:08:31PM -0000, Christos Zoulas wrote:

I do not know if this is NetBSD-related, but I suffer from FreeRADIUS
instability on NetBSD for a long time and do not know how to debug this.

Symptoms are: RADIUS server randomly (once a day or once a week) can stop
answering and this is not connected to the actual load. While in that state
it can be killed with -9 only, other signals do nothing, rc.d restart script
just hang.

I have compiled debug version of it and connected gdb:

(gdb) bt
#0  0x000077280da42b8a in _sys___kevent50 () from /usr/lib/libc.so.12
#1  0x000077280e807879 in __kevent50 () from /usr/lib/libpthread.so.1
#2  0x00007728106270e1 in fr_event_loop (el=0x7728105bcb20)
    at src/lib/event.c:625
#3  0x00000000004364dd in radius_event_process () at src/main/process.c:6056
#4  0x00000000004466c3 in main (argc=<optimized out>, argv=<optimized out>)
    at src/main/radiusd.c:641

gdb always show it is stuck in kevent call. radiusd was started with -txx
meaning no threads were used.

src/lib/event.c:625 says:

rcode = kevent(el->kq, NULL, 0, el->events, FR_EV_MAX_FDS, ts_wake);

It seems kevent call is misused somehow leading to not returning from
this syscall or syscall is blocked. What I can debug further?

Well, it seems that the signals are blocked and this does not have to
do with kevent (probably FreeRADIUS does it explicitly). You can use

ps -p $pid-of-freeradius -o sigmask,sigcatch,sigignore

to see what signals are handled.

BLOCKED CAUGHT  IGNORED
      0   44ab 98489000

Do I understand correctly that there is no blocked signals for this process?

Now, why kevent is stuck, is a different story. You can use
fstat -p $pid-of-freeradius to see what files it has open; perhaps this
will provide a clue.

I have compared fstat output of a working process and of hanged one.
The only difference is one LDAP connection which was changed (possibly
reconnected), this can be related or may be not.

What kind of another clue the output can tell?

radiusd radiusd 25118 wd /export 3924481 drwxr-xr-x 512 r radiusd radiusd 25118 0 / 106352 crw-rw-rw- null rw radiusd radiusd 25118 1 /var 1695507 -rw-r----- 401724344 w radiusd radiusd 25118 2 /var 1695507 -rw-r----- 401724344 w radiusd radiusd 25118 3* kqueue pending 0
radiusd  radiusd    25118    4* crypto 0xffffe715032aa4d0
radiusd radiusd 25118 5 / 106013 -rw-r--r-- 92 r radiusd radiusd 25118 6 /var 1695507 -rw-r----- 401724344 w radiusd radiusd 25118 7 /var 1695517 -rw-r----- 10828283 rw radiusd radiusd 25118 8 / 106013 -rw-r--r-- 92 r radiusd radiusd 25118 9* internet stream tcp central.:postgresql <-> almaz.:59270
radiusd  radiusd    25118   10* internet stream tcp steel.:ldap <-> almaz.:59264
radiusd  radiusd    25118   11* internet stream tcp steel.:ldap <-> almaz.:59263
radiusd  radiusd    25118   12* internet stream tcp central.:postgresql <-> almaz.:63159
radiusd  radiusd    25118   13* internet stream tcp central.:postgresql <-> almaz.:59262
radiusd  radiusd    25118   14* internet stream tcp central.:postgresql <-> almaz.:59234
radiusd  radiusd    25118   19* pipe 0xffffe71416359510 <- 0xffffe714e88e1530 rn
radiusd  radiusd    25118   20* pipe 0xffffe714e88e1530 -> 0xffffe71416359510 wn
radiusd  radiusd    25118   21* internet dgram udp *:radius
radiusd  radiusd    25118   22* internet dgram udp *:radius-acct
radiusd  radiusd    25118   23* internet6 dgram udp *:radius
radiusd  radiusd    25118   24* internet6 dgram udp *:radius-acct
radiusd  radiusd    25118   25* internet dgram udp localhost:18120
radiusd  radiusd    25118   26* internet dgram udp localhost:18121
radiusd  radiusd    25118   27* internet dgram udp *:51200
radiusd  radiusd    25118   28* internet6 dgram udp *:65238
radiusd  radiusd    25118   31* internet stream tcp steel.:ldap <-> almaz.:60645

FD #3 "kqueue pending" was there when it was working as well.

One more thing I do not understand - radiusd had never been caught hanging
if run in foreground. Is this kind of a clue?

--
Sincerely yours,
Dima Veselov
Physics R&D Establishment of Saint-Petersburg University


Home | Main Index | Thread Index | Old Index