Re: kern/56925: Amd64 server randomly panics

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,joel.bertrand%systella.fr@localhost
Subject: Re: kern/56925: Amd64 server randomly panics
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Fri, 15 Jul 2022 09:00:02 +0000 (UTC)

The following reply was made to PR kern/56925; it has been noted by GNATS.

From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: joel.bertrand%systella.fr@localhost
Cc: gnats-bugs%NetBSD.org@localhost, mlelstv%NetBSD.org@localhost
Subject: Re: kern/56925: Amd64 server randomly panics
Date: Fri, 15 Jul 2022 08:56:29 +0000

 I see two problems in the ccb_timeout and wake_ccb logic:
 
 1. ccb_timeout is called from softint context, and calls wake_ccb,
    which calls ccb_timeout_stop, which in turn sometimes calls kpause.
 
    Calling kpause is forbidden in softint context, so this has to
    change -- either the logic is never reached, in which case it's
    dead code and should be removed, or the logic is reachable, in
    which case it's broken and needs to be fixed.
 
 2. ccb_timeout_stop may fail to stop the callout from firing again, if
    it had already started.
 
    In particular, unless there are other non-obvious conditions making
    this scenario impossible, the following sequence of events may
    happen:
 
         (a) cpu0: callout fires and enters ccb_timeout
         (b) cpu1: ccb_timeout_stop calls callout_stop
         (c) cpu0: ccb_timeout calls ccb_timeout_start, which
                   reschedules the callout
         (d) cpu1: caller of ccb_timeout_stop assumes callout
                   will not fire again
         (e) a minute passes (COMMAND_TIMEOUT ticks)
         (f) cpuN: callout fires and enters ccb_timeout again
 
    If ccb_timeout_stop is not allowed to block, fixing this requires
    ccb_timeout_stop to issue callout_stop and set a flag under a lock,
    under which lock ccb_timeout tests the flag to decide whether to
    skip everything.  Then either:
 
         - if ccb_timeout_start is allowed to sleep, it must run
           callout_halt to wait for any concurrent firing of the
           callout to finish, clear the flag, and finally call
           callout_schedule; or
 
         - if ccb_timeout_start is forbidden from sleeping, it needs a
           second flag to ask ccb_timeout to reschedule itself without
           doing anything.
 
 Fixing this might draw inspiration from how these problems were
 addressed in the USB stack -- see sys/dev/usb/usbdi.c and the logic
 around struct usbd_xfer::ux_callout, ux_timeout_set, ux_timeout_reset,
 callout_ack/invoking, &c.
 
 (The USB stack has one additional complication: the abort action on
 timeout is has to be performed in thread context, so it's run in a USB
 task scheduled from a callout, not directly in the callout.  I don't
 know whether the ccb abort action (handle_connection_error) has the
 same constraint.

Prev by Date: Re: kern/56925: Amd64 server randomly panics
Next by Date: Re: kern/56925: Amd64 server randomly panics
Previous by Thread: Re: kern/56925: Amd64 server randomly panics
Next by Thread: Re: kern/56925: Amd64 server randomly panics
Indexes:

Home | Main Index | Thread Index | Old Index