NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/59578: open() hangs indefinitely on FIFO with frequent signal interruption
The following reply was made to PR kern/59578; it has been noted by GNATS.
From: furkanonder <furkanonder%protonmail.com@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc:
Subject: Re: kern/59578: open() hangs indefinitely on FIFO with frequent signal interruption
Date: Thu, 07 Aug 2025 20:41:49 +0000
Sent with Proton Mail secure email.
On Thursday, August 7th, 2025 at 10:35 AM, Michael van Elst via gnats <gnat=
s-admin%NetBSD.org@localhost> wrote:
> The following reply was made to PR kern/59578; it has been noted by GNATS=
.
>=20
> From: mlelstv%serpens.de@localhost (Michael van Elst)
> To: gnats-bugs%netbsd.org@localhost
> Cc:
> Subject: Re: kern/59578: open() hangs indefinitely on FIFO with frequent =
signal interruption
> Date: Thu, 7 Aug 2025 07:32:11 -0000 (UTC)
>=20
> furkanonder%protonmail.com@localhost writes:
>=20
> > The underlying problem is that when attempting to open a FIFO for writi=
ng (O_WRONLY) while the process receives frequent signals,
>=20
> > the open() system call gets repeatedly interrupted with EINTR and never=
completes successfully, even when a reader process is
>=20
> > available on the other end.
>=20
>=20
>=20
> Actually the reproducer is bogus as the nanosleep and more important
> the wait system calls are also interrupted by the signal (and not
> retried).
>=20
> But if you handle all this, there still seems to be a bug.
>=20
> What happens is that:
>=20
> - the parent opens the fifo and gets blocked
> - the child opens the fifo and succeeds, waking the parent
> - the parent gets interrupted before resuming and finishing the
> open system call, so open fails with EINTR.
> - the child closes the fifo and exits.
> - your code retries the open operation of the parent and waits
> forever for the child that is already gone.
>=20
>=20
> I'm not sure if that is an allowed scenario, but it would probably be
> better if the interrupted open call would still succeed when the
> reason for blocking is gone at the same time.
>=20
> An alternative would be to only allow the peer to proceed when
> the open system call returns success. But that synchronisation
> is more complex.
Thank you Michael for the detailed analysis. You've correctly identified th=
e core kernel issue.
While I acknowledge the race conditions in my reproducer (nanosleep/wait EI=
NTR handling), I respectfully disagree that it's fundamentally "bogus." The=
reproducer demonstrates a real kernel bug affecting production software:
It exposes the exact issue found in CPython's test suite - derived from Pyt=
hon's test_eintr module, which runs successfully on other Unix systems.
Similar reproducers exist for other systems - FreeBSD had an analogous issu=
e (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D203162) with "a C pr=
ogram based on the Python unit test."
The race conditions don't invalidate the core issue - Even with proper EINT=
R handling, the fundamental bug you identified still exists: when a parent'=
s open() gets interrupted after the child opens/closes the FIFO, the retry =
hangs indefinitely.
Your analysis pinpoints the exact problem: "the parent gets interrupted bef=
ore resuming and finishing the open system call, so open fails with EINTR. =
The child closes the fifo and exits. Your code retries the open operation o=
f the parent and waits forever for the child that is already gone."
This breaks Python's test suite on NetBSD while working correctly on Linux,=
FreeBSD, and other Unix systems.
I agree with your first proposed solution: allow interrupted open() to succ=
eed when the blocking condition is resolved. This aligns better with user e=
xpectations and POSIX semantics.
Home |
Main Index |
Thread Index |
Old Index