Subject: DOS on private network, and pf bug
To: None <tech-kern@netbsd.org>
From: George Georgalis <george@galis.org>
List: tech-kern
Date: 03/02/2007 16:20:30
I lost all shell access to my netbsd 3.1 box yesterday. it's on a
secure network where only trusted users can access it. they might
connect by a wired switch, wifi + vpn to that switch (locally or
remote) or they may have a vpn wired remote connection.

the host does very little, network protocol wise, internet and
mail gw, lan dns and sshd. it's running pf and pflogd (though I'm
not listening to pflogd atm) with fairly simple pf rules, other
than nat, pass and block lines we have the following pf rules:

scrub in all fragment reassemble
no rdr on lo0 from any to any
antispoof log for wm0 inet
antispoof log for wm1 inet
antispoof log for fxp0 inet

This setup has been working without issue for about 45 days. That changed
yesterday when nobody could connect to sshd; but everything else worked.
RAM test was perfect. the only anomaly was in the kernel log.

Our logging system rotates by log size not time, and we keep 10Mb
of old logs, that might change, but that's what we have.

The oldest log starts like this:

2007-02-28 13:34:29.767786500 pf_normalize_ip: reass frag 64442 @ 20720-22200
2007-02-28 13:34:29.767895500 pf_normalize_ip: reass frag 64442 @ 22200-23680
2007-02-28 13:34:29.767903500 pf_normalize_ip: reass frag 64442 @ 23680-25160
2007-02-28 13:34:29.767908500 pf_normalize_ip: reass frag 64442 @ 25160-26640
2007-02-28 13:34:29.767913500 pf_normalize_ip: reass frag 64442 @ 26640-28120
2007-02-28 13:34:29.767918500 pf_normalize_ip: reass frag 64442 @ 28120-29600
2007-02-28 13:34:29.767923500 pf_normalize_ip: reass frag 64442 @ 29600-31080
2007-02-28 13:34:29.767928500 pf_normalize_ip: reass frag 64442 @ 31080-32560
2007-02-28 13:34:29.767933500 pf_normalize_ip: reass frag 64442 @ 32560-32920
2007-02-28 13:34:29.767938500 pf_reassemble: 32920 < 32920?
2007-02-28 13:34:29.767943500 pf_reassemble: complete: 0xc3eb3200(32940)
2007-02-28 13:34:29.767947500 pf_normalize_ip: reass frag 64954 @ 0-1480
2007-02-28 13:34:29.767952500 pf_normalize_ip: reass frag 64954 @ 1480-2960
2007-02-28 13:34:29.767957500 pf_normalize_ip: reass frag 64954 @ 2960-4440
2007-02-28 13:34:29.767962500 pf_normalize_ip: reass frag 64954 @ 4440-5920
2007-02-28 13:34:29.767967500 pf_normalize_ip: reass frag 64954 @ 5920-7400
2007-02-28 13:34:29.767972500 pf_normalize_ip: reass frag 64954 @ 7400-8880
2007-02-28 13:34:29.767981500 pf_normalize_ip: reass frag 64954 @ 8880-10360
2007-02-28 13:34:29.767986500 pf_normalize_ip: reass frag 64954 @ 10360-11840
2007-02-28 13:34:29.768010500 pf_normalize_ip: reass frag 64954 @ 11840-13320
2007-02-28 13:34:29.768015500 pf_normalize_ip: reass frag 64954 @ 13320-14800
2007-02-28 13:34:29.768020500 pf_normalize_ip: reass frag 64954 @ 14800-16280
2007-02-28 13:34:29.768025500 pf_normalize_ip: reass frag 64954 @ 16280-17760
2007-02-28 13:34:29.768030500 pf_normalize_ip: reass frag 64954 @ 17760-19240
2007-02-28 13:34:29.768035500 pf_normalize_ip: reass frag 64954 @ 19240-20720
2007-02-28 13:34:29.768040500 pf_normalize_ip: reass frag 64954 @ 20720-22200
2007-02-28 13:34:29.768045500 pf_normalize_ip: reass frag 64954 @ 22200-23680
2007-02-28 13:34:29.768050500 pf_normalize_ip: reass frag 64954 @ 23680-25160
2007-02-28 13:34:29.768055500 pf_normalize_ip: reass frag 64954 @ 25160-26640
2007-02-28 13:34:29.768078500 pf_normalize_ip: reass frag 64954 @ 26640-28120
2007-02-28 13:34:29.768084500 pf_normalize_ip: reass frag 64954 @ 28120-29600
2007-02-28 13:34:29.768089500 pf_normalize_ip: reass frag 64954 @ 29600-31080
2007-02-28 13:34:29.768094500 pf_normalize_ip: reass frag 64954 @ 31080-32560
2007-02-28 13:34:29.768099500 pf_normalize_ip: reass frag 64954 @ 32560-32920
2007-02-28 13:34:29.768104500 pf_reassemble: 32920 < 32920?
2007-02-28 13:34:29.768109500 pf_reassemble: complete: 0xc3f84900(32940)


note, that most of that occurred in 4 10-thousandths of a second,
and it continued a similar rate for the next 10Mb. Normally we
have nothing but boot messages in this log.

when I 

grep -hEv '(pf_normalize_ip|pf_reassemble)'

10Mb results reduces to 21K; and consisting of
nothing more than the example with this discussion
http://lists.freebsd.org/pipermail/freebsd-pf/2006-November/002768.html

Daniel Hartmeier, explains an off by one error in pf
state tracking leads to log lines like this

Nov 10 15:40:24 ehost kernel: pf: State failure on:         |

the patch
http://www.openbsd.org/cgi-bin/cvsweb/src/sys/net/pf.c.diff?r1=1.514&r2=1.515&f=h

on my host I have about 10 of those right up at the end when sshd
stopped responding, and the other stuff stopped writing.

My best guess is windows hosts, configured to wifi vpn to the
same network they where wired to and which also provides a GW
produced some sort of STP like condition... yada yada yada ---
I don't really know what happened but wanted to report we lost
sshd. Actually we lost console to, I saw kernel messages on the
screen and was able to get a password: prompt after pressing
return, but after quite a while (1hr+) I was never able to get
another login prompt or switch virtual terminals.

I guess it was writing log files?  But with only gw, mail relay,
and dns available (and all working), we had to power cycle the
machine. I did fsck with an install disk and the only anomaly
was the pf klog entries. The system came up and has been running
without error and only dmesg boot stuff in klog entry ever since.
(accept for last night when memtest86+ was running and completed
with no error) -- I cannot reproduce the problem with workstations
doing every type of connection.

So, apparently Daniel Hartmeier's patch didn't apply to NetBSD,
and maybe there is a DOS condition with pf/sshd?

// George


-- 
George Georgalis, systems architect, administrator <IXOYE><