tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: pg_jobc going negative?



Le 09/06/2020 à 02:49, Kamil Rytarowski a écrit :
pg_jobc is a process group struct member counting the number of
processes with a parent capable of doing job control. Once reaching 0,
the process group is orphaned.

With the addition of asserts checking for pg_jobc > 0 (by maxv@), we
caught issues that pg_jobc can go negative. I have landed new ATF tests
that trigger this reliably with ptrace(2)
(src/tests/lib/libc/sys/t_ptrace_fork_wait.h r.1.7). The problem was
originally triggered with GDB.

There are also other sources of these asserts due to races
The ptrace(2) ATF tests are reliable triggering a negative pg_jobc,
however there are racy tests doing the same as triggered by syzkaller
(mentioned at least in [1]).

Should we allow pg_jobc going negative?

I don't think so, the code is not designed to expect negative values.

(Other BSDs allow this.)

They don't "allow it", they just don't have a KASSERT against it.

Is going negative in the first place a real bug?

There were 11 bugs reported by syzbot which showed severe memory corruption
in this area. While investigating the issues, I looked at the recfounting
stuff and added KASSERTs for sanity checking, and then realized they were
actually firing when using the different reproducers generated by syzbot.

Since I put these KASSERTs, 10 of the 11 random bugs didn't trigger, and
instead it is the KASSERTs that fire earlier. Listed as duplicates:

	https://syzkaller.appspot.com/bug?id=50a4ddd341b90cf15a4814048ff51db04347279a

You can see they are all different, but all have to do with reading the
group pointer, which was either freed, overwritten, not initialized,
unmapped, or contained pure garbage. This is typical of refcounting bugs
where a resource disappears under your feet.

Only one keeps firing once in a while (KASAN); that's understandable,
because even though there is a big window where the KASSERTs can fire, the
underlying race is still there and there is still the chance we miss the
window and get memory corruptions.

In short, (1) my understanding of it is that the code is not designed to
expect negative values, and (2) since I added the KASSERTs 10/11 of the
random bugs didn't trigger. Big signs the bug is indeed related to
refcounting.

It would be nice if someone with better understanding than me of the lwp
group stuff could have a look, though.

Maxime


Home | Main Index | Thread Index | Old Index