On Wed, 3 Aug 2016, Christos Zoulas wrote:
In article <Pine.NEB.4.64.1608021259170.524%ugly.internal.precedence.co.uk@localhost>,
Stephen Borrill <netbsd%precedence.co.uk@localhost> wrote:
On Mon, 1 Aug 2016, metalliqaz%fastmail.fm@localhost wrote:
I've been very disappointed with the quality of NetBSD 7.0.1 since I
upgraded from 6.1.5 a few weeks ago. I've been running pretty much the
same system config as my home router/NAT/firewall/server since NetBSD
1.5. I believe that's almost 15 years of ipfilter/ipnat. It has always
worked well for me... until I moved to NetBSD 7. I've had several
issues with various parts of the OS, but ipf is the one that causes
random kernel panics.
I've got to agree with you. I've been using NetBSD for commercial products
since 1996 and NetBSD 7 is the first upgrade that's got me nervous.
Kudos to developers who've helped out with USB failing to work, squid
interception, etc. The random lockups and panics with IPfilter are the
most worrying for me though:
http://gnats.netbsd.org/50168
I believe that the bugs are triggered by external packets which is why
they are random (disconnecting from the Internet stops the problems).
Machines which have been solid for months have just started locking. I
count this as a remote DoS vulnerability, but haven't yet tracked down
the triggers.
We need to support an installed base of a mix of netbsd-5 and
netbsd-7 machines. Until we complete the upgrade to netbsd-7, npf will
increase that workload because of duplication of effort. Even so as the
firewall rules are autogenerated and have been developed over a number of
years, it is not a small change to go into production systems.
-------------------------------------------------------------------------------
bash-4.3# crash -M netbsd.0.core -N netbsd.0
Crash version 7.0.1, image version 7.0.1.
System panicked: trap
Backtrace from time of crash is available.
crash> bt
_KERNEL_OPT_NARCNET() at 0
_KERNEL_OPT_ACPI_SCANPCI() at _KERNEL_OPT_ACPI_SCANPCI+0x1
vpanic() at vpanic+0x145
snprintf() at snprintf
startlwp() at startlwp
calltrap() at calltrap+0x11
ipf_frag_expire() at ipf_frag_expire+0x76
ipf_slowtimer() at ipf_slowtimer+0x15
ipf_timer_func() at ipf_timer_func+0x2d
callout_softclock() at callout_softclock+0x248
softint_dispatch() at softint_dispatch+0x7d
DDB lost frame for Xsoftintr+0x4f, trying 0xfffffe80cefcaff0
Xsoftintr() at Xsoftintr+0x4f
--- interrupt ---
0:
crash> q
bash-4.3# crash -M netbsd.1.core -N netbsd.1
Crash version 7.0.1, image version 7.0.1.
System panicked: trap
Backtrace from time of crash is available.
crash> bt
_KERNEL_OPT_NARCNET() at 0
_KERNEL_OPT_ACPI_SCANPCI() at _KERNEL_OPT_ACPI_SCANPCI+0x7
vpanic() at vpanic+0x145
snprintf() at snprintf
startlwp() at startlwp
calltrap() at calltrap+0x11
ipf_frag_delete() at ipf_frag_delete+0x74
So the machine that's just started exhibiting this problem is panicking
in
the same way as above:
curlwp 0xfffffe847ef2c860 pid 0.5 lowest kstack 0xfffffe811d2042c0
panic: trap
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
snprintf() at netbsd:snprintf
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
ipf_frag_expire() at netbsd:ipf_frag_expire+0x152
ipf_slowtimer() at netbsd:ipf_slowtimer+0x15
ipf_timer_func() at netbsd:ipf_timer_func+0x2d
callout_softclock() at netbsd:callout_softclock+0x248
softint_dispatch() at netbsd:softint_dispatch+0x79
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe811d206ff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
Contrary to my PR 50168 (which has a different traceback), this is
happening even with the only firewall rules being "pass in all" and
"pass out all". In my PR, this was a sufficient workaround, so this looks
like a different problem.
This seems to be dying at:
static void
ipf_frag_free(ipf_frag_softc_t *softf, ipfr_t *fra)
{
KFREE(fra);
-> FBUMP(ifs_expire);
softf->ipfr_stats.ifs_inuse--;
}
I would comment the last 2 lines and see if I get something better.
There seems to be some memory corruption (surprise)....
So:
--- sys/external/bsd/ipf/netinet/ip_frag.c 22 Jul 2012 14:27:51
-0000
1.3
+++ sys/external/bsd/ipf/netinet/ip_frag.c 4 Aug 2016 10:37:03 -0000
@@ -990,8 +990,8 @@
ipf_frag_free(ipf_frag_softc_t *softf, ipfr_t *fra)
{
KFREE(fra);
- FBUMP(ifs_expire);
- softf->ipfr_stats.ifs_inuse--;
+/* FBUMP(ifs_expire);
+ softf->ipfr_stats.ifs_inuse--;*/
}
What's the "something better" you are expecting? On this machine (amd64)
I
seem to be unable to get coredumps. I suspect this may be down to 16GB
swap partition and 16GB RAM.
--
Stephen