NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/58996: wg(4): stuck in INIT_ACTIVE state
>Number: 58996
>Category: kern
>Synopsis: wg(4): stuck in INIT_ACTIVE state
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jan 15 13:35:00 +0000 2025
>Originator: Taylor R Campbell
>Release: 10
>Organization:
The NetWG Hangdation
>Environment:
>Description:
I found wg-userspace got stuck. I attached gdb to the rump_server process. The stable session is in UNKNOWN state (i.e., it's all torn down). The unstable session is in INIT_ACTIVE state and wgp_pending is nonnull, i.e., it has a packet ready to be sent once the handshake completes.
But:
- wgp_handshake_timeout_timer has been halted (flags = 256 = 0x100 = CALLOUT_MPSAFE -- no CALLOUT_PENDING, CALLOUT_FIRED, or CALLOUT_INVOKING)
- wgp_tasks is 0 (no tasks pending)
This is weird, because:
1. The only path that enters WGS_STATE_INIT_ACTIVE is wg_send_handshake_msg_init, and that either
(a) fails and returns to WGS_STATE_UNKNOWN via wg_put_session_index; or
(b) schedules wgp_handshake_timeout_timer.
2. The only paths that halt wgp_handshake_timeout_timer are wg_destroy_peer (obviously not used here), and wg_handle_msg_resp which transitions to WGS_STATE_ESTABLISHED.
3. If wgp_handshake_timeout_timer fires, it will schedule task WGP_TASK_RETRY_HANDSHAKE; if that task runs, wg_task_retry_handshake will transition to WGS_STATE_UNKNOWN and maybe redo wg_send_handshake_msg_init.
So how can it enter this state?
Recent dmesg output with `sysctl -w net.wg.debug=-1' while sending a few pings (unfortunately, I don't have the dmesg output from when it _entered_ this state):
[ 136289.940005] wg_schedule_session_dtor_timer: scheduling session dtor in 180 secs
[ 136289.940005] wg_schedule_peer_task: tasks=0, task=32
[ 136289.940005] wg_task_destroy_prev_session:3505: WGP_TASK_DESTROY_PREV_SESSION
[ 136469.960005] wg_session_dtor_timer:2810: enter
[ 136469.960005] wg_schedule_session_dtor_timer: scheduling session dtor in 180 secs
[ 136469.960005] wg_schedule_peer_task: tasks=0, task=32
[ 136469.960005] wg_task_destroy_prev_session:3505: WGP_TASK_DESTROY_PREV_SESSION
[ 136649.990005] wg_session_dtor_timer:2810: enter
[ 136649.990005] wg_schedule_session_dtor_timer: scheduling session dtor in 180 secs
[ 136649.990005] wg_schedule_peer_task: tasks=0, task=32
[ 136650.010005] wg_task_destroy_prev_session:3505: WGP_TASK_DESTROY_PREV_SESSION
[ 136809.980005] rumpkern_wg_recv_user:5557:
[ 136809.980005] rumpkern_wg_recv_user: iov_len=56
[ 136809.980005] rumpkern_wg_recv_user: 60 00 00 00 00 10 3a 40 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 02 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 69 bb 96 c8 00 02 67 87 b6 3f 00 04 67 5d
[ 136809.980005] wg_pick_peer_by_sa: sa=inet6: fd01::1
[ 136809.980005] wg_pick_peer_by_sa:4295: success
[ 136809.980005] wgintr:3809: no stable session
[ 136809.980005] wgintr:3819: first packet already queued, dropping
[ 136810.980005] rumpkern_wg_recv_user:5557:
[ 136810.980005] rumpkern_wg_recv_user: iov_len=56
[ 136810.980005] rumpkern_wg_recv_user: 60 00 00 00 00 10 3a 40 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 02 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 62 5e 96 c8 00 03 67 87 b6 40 00 04 6e b8
[ 136810.980005] wg_pick_peer_by_sa: sa=inet6: fd01::1
[ 136810.980005] wg_pick_peer_by_sa:4295: success
[ 136810.980005] wgintr:3809: no stable session
[ 136810.980005] wgintr:3819: first packet already queued, dropping
[ 136829.990005] wg_session_dtor_timer:2810: enter
[ 136829.990005] wg_schedule_session_dtor_timer: scheduling session dtor in 180 secs
[ 136829.990005] wg_schedule_peer_task: tasks=0, task=32
[ 136829.990005] wg_task_destroy_prev_session:3505: WGP_TASK_DESTROY_PREV_SESSION
[ 136841.080005] rumpkern_wg_recv_user:5557:
[ 136841.080005] rumpkern_wg_recv_user: iov_len=56
[ 136841.080005] rumpkern_wg_recv_user: 60 00 00 00 00 10 3a 40 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 02 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 77 50 f7 53 00 00 67 87 b6 5e 00 05 f9 1e
[ 136841.080005] wg_pick_peer_by_sa: sa=inet6: fd01::1
[ 136841.080005] wg_pick_peer_by_sa:4295: success
[ 136841.080005] wgintr:3809: no stable session
[ 136841.080005] wgintr:3819: first packet already queued, dropping
[ 136842.080005] rumpkern_wg_recv_user:5557:
[ 136842.080005] rumpkern_wg_recv_user: iov_len=56
[ 136842.080005] rumpkern_wg_recv_user: 60 00 00 00 00 10 3a 40 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 02 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 72 90 f7 53 00 01 67 87 b6 5f 00 05 fd dc
[ 136842.080005] wg_pick_peer_by_sa: sa=inet6: fd01::1
[ 136842.080005] wg_pick_peer_by_sa:4295: success
[ 136842.080005] wgintr:3809: no stable session
[ 136842.080005] wgintr:3819: first packet already queued, dropping
[ 136846.670005] rumpkern_wg_recv_user:5557:
[ 136846.670005] rumpkern_wg_recv_user: iov_len=56
[ 136846.670005] rumpkern_wg_recv_user: 60 00 00 00 00 10 3a 40 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 02 fd 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 4b fa 21 f9 00 00 67 87 b6 63 00 0e f9 c1
[ 136846.670005] wg_pick_peer_by_sa: sa=inet6: fd01::1
[ 136846.670005] wg_pick_peer_by_sa:4295: success
[ 136846.670005] wgintr:3809: no stable session
[ 136846.670005] wgintr:3819: first packet already queued, dropping
The wg(4) debug output in dmesg appears to be substantively different from PR kern/58938: wg tunnel dies after a few days
<https://gnats.NetBSD.org/58938> so I'm filing this as a separate PR. In particular, this is about one endpoint getting stuck and not transmitting anything new, while the other one is about both endpoints talking past each other or something -- or possibly sending handshakes only to stale addresses or something, not sure, but looks different in any case.
>How-To-Repeat:
run a wg(4) tunnel for a while (days or weeks)
>Fix:
Yes, please!
Home |
Main Index |
Thread Index |
Old Index