NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/59578: open() hangs indefinitely on FIFO with frequent signal interruption



The following reply was made to PR kern/59578; it has been noted by GNATS.

From: furkanonder <furkanonder%protonmail.com@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: kern/59578: open() hangs indefinitely on FIFO with frequent signal interruption
Date: Thu, 07 Aug 2025 20:41:49 +0000

 Sent with Proton Mail secure email.
 
 On Thursday, August 7th, 2025 at 10:35 AM, Michael van Elst via gnats <gnat=
 s-admin%NetBSD.org@localhost> wrote:
 
 > The following reply was made to PR kern/59578; it has been noted by GNATS=
 .
 >=20
 > From: mlelstv%serpens.de@localhost (Michael van Elst)
 > To: gnats-bugs%netbsd.org@localhost
 > Cc:
 > Subject: Re: kern/59578: open() hangs indefinitely on FIFO with frequent =
 signal interruption
 > Date: Thu, 7 Aug 2025 07:32:11 -0000 (UTC)
 >=20
 > furkanonder%protonmail.com@localhost writes:
 >=20
 > > The underlying problem is that when attempting to open a FIFO for writi=
 ng (O_WRONLY) while the process receives frequent signals,
 >=20
 > > the open() system call gets repeatedly interrupted with EINTR and never=
  completes successfully, even when a reader process is
 >=20
 > > available on the other end.
 >=20
 >=20
 >=20
 > Actually the reproducer is bogus as the nanosleep and more important
 > the wait system calls are also interrupted by the signal (and not
 > retried).
 >=20
 > But if you handle all this, there still seems to be a bug.
 >=20
 > What happens is that:
 >=20
 > - the parent opens the fifo and gets blocked
 > - the child opens the fifo and succeeds, waking the parent
 > - the parent gets interrupted before resuming and finishing the
 > open system call, so open fails with EINTR.
 > - the child closes the fifo and exits.
 > - your code retries the open operation of the parent and waits
 > forever for the child that is already gone.
 >=20
 >=20
 > I'm not sure if that is an allowed scenario, but it would probably be
 > better if the interrupted open call would still succeed when the
 > reason for blocking is gone at the same time.
 >=20
 > An alternative would be to only allow the peer to proceed when
 > the open system call returns success. But that synchronisation
 > is more complex.
 
 Thank you Michael for the detailed analysis. You've correctly identified th=
 e core kernel issue.
 While I acknowledge the race conditions in my reproducer (nanosleep/wait EI=
 NTR handling), I respectfully disagree that it's fundamentally "bogus." The=
  reproducer demonstrates a real kernel bug affecting production software:
 
 It exposes the exact issue found in CPython's test suite - derived from Pyt=
 hon's test_eintr module, which runs successfully on other Unix systems.
 
 Similar reproducers exist for other systems - FreeBSD had an analogous issu=
 e (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D203162) with "a C pr=
 ogram based on the Python unit test."
 
 The race conditions don't invalidate the core issue - Even with proper EINT=
 R handling, the fundamental bug you identified still exists: when a parent'=
 s open() gets interrupted after the child opens/closes the FIFO, the retry =
 hangs indefinitely.
 
 Your analysis pinpoints the exact problem: "the parent gets interrupted bef=
 ore resuming and finishing the open system call, so open fails with EINTR. =
 The child closes the fifo and exits. Your code retries the open operation o=
 f the parent and waits forever for the child that is already gone."
 
 This breaks Python's test suite on NetBSD while working correctly on Linux,=
  FreeBSD, and other Unix systems.
 
 I agree with your first proposed solution: allow interrupted open() to succ=
 eed when the blocking condition is resolved. This aligns better with user e=
 xpectations and POSIX semantics.
 


Home | Main Index | Thread Index | Old Index