NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/58996: wg(4): stuck in INIT_ACTIVE state



>Number:         58996
>Category:       kern
>Synopsis:       wg(4): stuck in INIT_ACTIVE state
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jan 15 13:35:00 +0000 2025
>Originator:     Taylor R Campbell
>Release:        10
>Organization:
The NetWG Hangdation
>Environment:
>Description:
I found wg-userspace got stuck.  I attached gdb to the rump_server process.  The stable session is in UNKNOWN state (i.e., it's all torn down).  The unstable session is in INIT_ACTIVE state and wgp_pending is nonnull, i.e., it has a packet ready to be sent once the handshake completes.

But:

- wgp_handshake_timeout_timer has been halted (flags = 256 = 0x100 = CALLOUT_MPSAFE -- no CALLOUT_PENDING, CALLOUT_FIRED, or CALLOUT_INVOKING)
- wgp_tasks is 0 (no tasks pending)

This is weird, because:

1. The only path that enters WGS_STATE_INIT_ACTIVE is wg_send_handshake_msg_init, and that either
   (a) fails and returns to WGS_STATE_UNKNOWN via wg_put_session_index; or
   (b) schedules wgp_handshake_timeout_timer.

2. The only paths that halt wgp_handshake_timeout_timer are wg_destroy_peer (obviously not used here), and wg_handle_msg_resp which transitions to WGS_STATE_ESTABLISHED.

3. If wgp_handshake_timeout_timer fires, it will schedule task WGP_TASK_RETRY_HANDSHAKE; if that task runs, wg_task_retry_handshake will transition to WGS_STATE_UNKNOWN and maybe redo wg_send_handshake_msg_init.

So how can it enter this state?

Recent dmesg output with `sysctl -w net.wg.debug=-1' while sending a few pings (unfortunately, I don't have the dmesg output from when it _entered_ this state):

[ 136289.940005] wg_schedule_session_dtor_timer: scheduling session dtor in 180 secs
[ 136289.940005] wg_schedule_peer_task: tasks=0, task=32
[ 136289.940005] wg_task_destroy_prev_session:3505: WGP_TASK_DESTROY_PREV_SESSION
[ 136469.960005] wg_session_dtor_timer:2810: enter
[ 136469.960005] wg_schedule_session_dtor_timer: scheduling session dtor in 180 secs
[ 136469.960005] wg_schedule_peer_task: tasks=0, task=32
[ 136469.960005] wg_task_destroy_prev_session:3505: WGP_TASK_DESTROY_PREV_SESSION
[ 136649.990005] wg_session_dtor_timer:2810: enter
[ 136649.990005] wg_schedule_session_dtor_timer: scheduling session dtor in 180 secs
[ 136649.990005] wg_schedule_peer_task: tasks=0, task=32
[ 136650.010005] wg_task_destroy_prev_session:3505: WGP_TASK_DESTROY_PREV_SESSION
[ 136809.980005] rumpkern_wg_recv_user:5557:
[ 136809.980005] rumpkern_wg_recv_user: iov_len=56
[ 136809.980005] rumpkern_wg_recv_user:  60 00 00 00 00 10 3a 40 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 02 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 69 bb 96 c8 00 02 67 87 b6 3f 00 04 67 5d
[ 136809.980005] wg_pick_peer_by_sa: sa=inet6: fd01::1
[ 136809.980005] wg_pick_peer_by_sa:4295: success
[ 136809.980005] wgintr:3809: no stable session
[ 136809.980005] wgintr:3819: first packet already queued, dropping
[ 136810.980005] rumpkern_wg_recv_user:5557:
[ 136810.980005] rumpkern_wg_recv_user: iov_len=56
[ 136810.980005] rumpkern_wg_recv_user:  60 00 00 00 00 10 3a 40 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 02 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 62 5e 96 c8 00 03 67 87 b6 40 00 04 6e b8
[ 136810.980005] wg_pick_peer_by_sa: sa=inet6: fd01::1
[ 136810.980005] wg_pick_peer_by_sa:4295: success
[ 136810.980005] wgintr:3809: no stable session
[ 136810.980005] wgintr:3819: first packet already queued, dropping
[ 136829.990005] wg_session_dtor_timer:2810: enter
[ 136829.990005] wg_schedule_session_dtor_timer: scheduling session dtor in 180 secs
[ 136829.990005] wg_schedule_peer_task: tasks=0, task=32
[ 136829.990005] wg_task_destroy_prev_session:3505: WGP_TASK_DESTROY_PREV_SESSION
[ 136841.080005] rumpkern_wg_recv_user:5557:
[ 136841.080005] rumpkern_wg_recv_user: iov_len=56
[ 136841.080005] rumpkern_wg_recv_user:  60 00 00 00 00 10 3a 40 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 02 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 77 50 f7 53 00 00 67 87 b6 5e 00 05 f9 1e
[ 136841.080005] wg_pick_peer_by_sa: sa=inet6: fd01::1
[ 136841.080005] wg_pick_peer_by_sa:4295: success
[ 136841.080005] wgintr:3809: no stable session
[ 136841.080005] wgintr:3819: first packet already queued, dropping
[ 136842.080005] rumpkern_wg_recv_user:5557:
[ 136842.080005] rumpkern_wg_recv_user: iov_len=56
[ 136842.080005] rumpkern_wg_recv_user:  60 00 00 00 00 10 3a 40 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 02 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 72 90 f7 53 00 01 67 87 b6 5f 00 05 fd dc
[ 136842.080005] wg_pick_peer_by_sa: sa=inet6: fd01::1
[ 136842.080005] wg_pick_peer_by_sa:4295: success
[ 136842.080005] wgintr:3809: no stable session
[ 136842.080005] wgintr:3819: first packet already queued, dropping
[ 136846.670005] rumpkern_wg_recv_user:5557:
[ 136846.670005] rumpkern_wg_recv_user: iov_len=56
[ 136846.670005] rumpkern_wg_recv_user:  60 00 00 00 00 10 3a 40 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 02 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 4b fa 21 f9 00 00 67 87 b6 63 00 0e f9 c1
[ 136846.670005] wg_pick_peer_by_sa: sa=inet6: fd01::1
[ 136846.670005] wg_pick_peer_by_sa:4295: success
[ 136846.670005] wgintr:3809: no stable session
[ 136846.670005] wgintr:3819: first packet already queued, dropping

The wg(4) debug output in dmesg appears to be substantively different from PR kern/58938: wg tunnel dies after a few days
 <https://gnats.NetBSD.org/58938> so I'm filing this as a separate PR.  In particular, this is about one endpoint getting stuck and not transmitting anything new, while the other one is about both endpoints talking past each other or something -- or possibly sending handshakes only to stale addresses or something, not sure, but looks different in any case.
>How-To-Repeat:
run a wg(4) tunnel for a while (days or weeks)
>Fix:
Yes, please!



Home | Main Index | Thread Index | Old Index