tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: lockdebug kernel instacrash when npf enabled



Replying to my own posting :-) I just gotten bitten by this on HEAD.
I don't think it is acceptable for LOCKDEBUG to be broken for such
a long time on HEAD, and we should not be cutting a branch without
fixing it. So what should we do?

Best,

christos

| In article <20190224212250.2B9ED84DB1%mail.netbsd.org@localhost>,
| Mindaugas Rasiukevicius  <rmind%netbsd.org@localhost> wrote:
| >Tobias Nygren <tnn%NetBSD.org@localhost> wrote:
| >> > Enabling NPF.
| >> > [  22.6038371] panic: kernel debugging assertion
| >> > "pserialize_not_in_read_section()" failed: file
| >> > "/work/src/sys/kern/kern_mutex.c", line 527 [  22.7529500] cpu0: Begin
| >> > traceback... [  22.7976654] 0x99deba54: netbsd:db_panic+0x14
| >> > [  22.8465447] 0x99deba6c: netbsd:vpanic+0x194 [  22.8985454]
| >> > 0x99deba84: netbsd:__aeabi_uldivmod [  22.9505468] 0x99debb04:
| >> > netbsd:mutex_enter+0x5f4 [  22.9994280] 0x99debb4c:
| >> > netbsd:npf_table_lookup+0x134 [  23.0597517] 0x99debb74:
| >> <...>
| >> 
| >> r1.29 of npf_tableset.c changed t_lock from IPL_NET to IPL_NONE.
| >> Based on the above it looks like it needs to be at IPL_SOFTNET.
| >> @rmind you could please have a look?
| >
| >It is a bug, but only one aspect of it.  Yes, the mutex can be IPL_SOFTNET,
| >but it actually behaves more or less as IPL_NONE.  The real bug is that the
| >code path in question might block.  There are a few ways to fix this:
| >
| >- Convert the mutex to spin-lock at IPL_NET (but it is excessive) and
| >convert the memory allocations in that code path to KM_NOSLEEP.
| >
| >- Extend pserialize(9) by implementing Sleepable RCU (SRCU) or equivalent.
| >
| >- Sprinkle psref(9), but that is ugly and undesirable in the long-term.
| >
| >I have not had free time to work on a solution yet, but I hope to fix
| >this soonish and commit with a next batch of the NPF fixes/improvements.
| >
| >Meanwhile, if you want to run with LOCKDEBUG until this gets fixed, then
| >as a workaround I can suggest to comment out that assert as you are very
| >unlikely to hit the crash condition of this bug; it can only happen when
| >you perform NPF reload, plus you need to be unlucky enough to have the
| >relevant mutex (used only for LPM-type tables) contended and blocking.
| 
| But commenting out the asset will cripple the test for everything. We've
| discussed this before and we even had a psref patch IIRC, why did it
| get lost?
| 
| christos


Home | Main Index | Thread Index | Old Index