Subject: pf trouble (packet corruption?)
To: None <tech-net@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-net
Date: 01/20/2006 14:47:23
Any pf gurus in the house?  I'm havinnng some trouble with pf;
preliminary indications are that it's corrupting packet contents.

I'm trying to do something somewhat unconventional.  While struggling
with getting pf to do it, I finally resorted to a very simple ping.
The response came back according to tcpdump, but not according to ping.
I finally tried tcpdump -x, and found the reason ping wasn't reporting
the return packets.

Here is tcpdump -x output for an offending example.  This tcpdump was
taken at the interface of the responding machine; I'll explain below
where pf comes into it.

16:58:52.844256 IP 192.168.0.1 > 192.168.0.129: icmp 64: echo request seq 0
        0x0000:  4500 0054 cdda 0000 ff01 6bfb c0a8 0001  E..T......k.....
        0x0010:  c0a8 0081 0800 a132 0256 0000 43d1 5d1c  .......2.V..C.].
        0x0020:  000c c87a 0809 0a0b 0c0d 0e0f 1011 1213  ...z............
        0x0030:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223  .............!"#
        0x0040:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233  $%&'()*+,-./0123
        0x0050:  3435 3637                                4567
16:58:52.844296 IP 192.168.0.129 > 192.168.0.1: icmp 64: echo reply seq 0
        0x0000:  4500 0054 0342 0000 ff01 3694 c0a8 0081  E..T.B....6.....
        0x0010:  c0a8 0001 0000 a932 0256 0000 43d1 5d1c  .......2.V..C.].
        0x0020:  000c c87a 0000 0a0b 0c0d 0e0f 1011 1213  ...z............
        0x0030:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223  .............!"#
        0x0040:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233  $%&'(

Note that the third data word on the third line has been changed from
0809 to 0000.  This does not happen when pinging between the same two
hosts over another link, so it's not just that all ping responses are
broken.

Specifically, here's an ascii graphic of the setup.  (At the end of
this mail I have full details, from ifconfig and, for the machine
running pf, kernel config diffs.)

+------+
|      | vr1 10.10.10.84/23
| ahab +---------------------------------------------------
|      |                                                   |
+--+---+                                                   |
   | vr0                                                   |
   | 172.16.0.32/24                                        |
   | 172.16.0.16/24 alias                                  |
   |   (this link is a crossover cable)                    |
   | 172.16.0.1/24                                         |
   | fxp0                                                  |
+--+------+                                                |
|         | fxp2 10.10.10.83/23                            |
| legree  +--------------------------------------------    |
|         |                                            |   |
+-+--+--+-+                                            |   |
  |  |  |<-- ex0                                       |   |
  |  |  |    192.168.0.129/30                          |   |
  |  |  |    192.168.0.{200,201,202,203,204}/32 alias  |   |
  |  |<------ vr0                                      |   |
  |  |  |    192.168.0.133/30                          |   |
  |  |  |    192.168.0.{210,211,212,213,214}/32 alias  |   |
  |<-------- fxp1                                      |   |
  |  |  |    192.168.0.137/30                          |   |
  |  |  |    192.168.0.{220,221,222,223,224}/32 alias  |   |
+-+--+--+-+                                            |   |
| switch  |                                            |   |
+---+-----+                                            |   |
    |                                                  |   |
    | 192.168.0.1/24                                   |   |
    | qe0                                              |   |
+---+---+                                              |   |
|       | le0 10.10.10.20/23                           |   |
| water +------------------------------------------    |   |
|       |                                          |   |   |
+-------+                                        +-+---+---+-+
                                                 |  switch   |
                                                 +-----------+

This is, of course, a mockup of a rather more complicated setup; the
three lines from legree towards water are standing in for what will in
the final setup be some kind of microwave link from a commercial
provider; I've done tests on the real thing and believe this mockup is
a reasonable facsimile for these purposes.  In any case, it's this
mockup that I have questions about.  Water and ahab are nothing but
test endpoints, but in case it matters, I'm including their ifconfig
details below too.  The addresses given above are not the final
addresses, of course, but they're the addresses I'm actually using for
development and testing.

This is aimed at a firewall machine (legree being the firewall, ahab
being the "inside", and water the "outside"; the 10.10.10/23
connections are for development convenience and will disappear in the
final setup).  Of the 18 addresses on legree's outside-facing
interfaces, half - three per connection - are to be used for incoming
connections, with multiple A records in the DNS used to get a
rudimentary form of load sharing among connections.  The other three
per link are to be used as source addresses for NATted outgoing
connections, with pf's "round-robin" clause used to load-share.

Here is legree's pf.conf.  I have modified it only by filtering it
through egrep -v '^#' | egrep -v '^$' (in the interests of keeping
this mail from getting any more out of hand than it already is).  It's
quite possible I've done something wrong here; I've been playing with
it in attempts to make this work, so far with little success. :(

intif=fxp0
ant1=ex0
ant2=vr0
ant3=fxp1
oflan=fxp2
table <ain1> const { 192.168.0.202, 192.168.0.203, 192.168.0.204 }
table <ain2> const { 192.168.0.212, 192.168.0.213, 192.168.0.214 }
table <ain3> const { 192.168.0.222, 192.168.0.223, 192.168.0.224 }
table <aout1> const { 192.168.0.129, 192.168.0.200, 192.168.0.201 }
table <aout2> const { 192.168.0.133, 192.168.0.210, 192.168.0.211 }
table <aout3> const { 192.168.0.137, 192.168.0.220, 192.168.0.221 }
table <aout> const {							\
		192.168.0.129, 192.168.0.200, 192.168.0.201,		\
		192.168.0.133, 192.168.0.210, 192.168.0.211,		\
		192.168.0.137, 192.168.0.220, 192.168.0.221 }
aup1 = 192.168.0.130
aup2 = 192.168.0.134
aup3 = 182.169.0.138
table <inside> const { 172.16.0.0/24 }
table <incoming> const { 172.16.0.16/28 }
table <outgoing> const { 172.16.0.32/28 }
mapports = "{" 80 22 "}"
set timeout { interval 5, frag 60, src.track 900 }
set timeout tcp.first 300
set timeout tcp.established 259200
set timeout { adaptive.start 0, adaptive.end 65536 }
set limit { states 65536, frags 65536 }
set optimization conservative
set block-policy return
set state-policy if-bound
scrub in on $intif all random-id fragment reassemble
rdr on $ant1 inet proto tcp from any to <ain1> port $mapports -> 172.16.0.16
rdr on $ant2 inet proto tcp from any to <ain2> port $mapports -> 172.16.0.16
rdr on $ant3 inet proto tcp from any to <ain3> port $mapports -> 172.16.0.16
no nat on $intif from ($intif) to <inside>
nat on $ant1 from <inside> to any -> <aout> round-robin
pass out quick on $ant1 route-to ( $ant1 $aup1 ) from { <ain1> <aout1> } to any
pass out quick on $ant1 route-to ( $ant2 $aup2 ) from { <ain2> <aout2> } to any
pass out quick on $ant1 route-to ( $ant3 $aup3 ) from { <ain3> <aout3> } to any
pass quick all

Legree's default route points to 192.168.0.130 (this is why the three
"pass out quick" lines name $ant1).  I added an rc.d script on legree
that hardwires three arp table entries, for 192.168.0.130, .134, and
.138, to all have water's qe0 MAC address.  This was necessary in order
to get packets to flow outbound at all, since .1 is off-net for the
/30s, and I can't put the same /24 on three different interfaces.  (Or
can I?  I wouldn't expect it to DTRT, even if it takes it.)

# REQUIRE: network
# BEFORE: NETWORKING pf
...boilerplate...
macaddr=02:00:02:33:ea:94

hardwiredarp_start()
{
	. /etc/ifnames
	for intf in $ANT1 $ANT2 $ANT3; do
		arp -s `ifconfig $intf |
		  sed -n -e '/inet/ {' -e 's/.*inet //' -e 's/ .*//' -e 's/\./ /g' -e p -e q -e '}' |
		  awk '{ print $1 "." $2 "." $3 "." $4+3-(2*($4%4)) }'` $macaddr
	done
}

(In case it matters, ahab's default route is 172.16.0.1 and water's
10.10.10.1, the latter being an outbound gateway I didn't draw above.)

The broken ping behaviour I quoted arises when pinging, as the tcpdump
output says, one of legree's addresses from water.  If I ping
10.10.10.83 instead, everything works, and tcpdump -x shows no
corruption.  The only involvement of pf in the "broken" ping as far as
I can see is that the output packet is rerouted by one of the "pass out
quick" rules, but maybe I'm misinterpreting.

It's possible that everything works except for the packet corruption;
thinking about it, most of the more puzzling symptoms I saw could be
explained by corrupted packets causing checksum failures and thus
dropped packets.

So, what am I doing wrong here?  "Trying to use pf" would be a valid
answer, if pf in 3.0 simply isn't ready for use; I've been thinking
about alternatives against the possibility that this simply can't be
made to work.  (In particular, is ipnat/ipf capable of doing anything
like pf's round-robin (or random) for outbound load-sharing?  That is
the major reason for using pf in this setup.)

If there's some better way of doing this kind of load-sharing, that too
I'd love to hear about.  The provider that the switch between water and
legree is sitting in for is rather unhelpful about it; for example,
they won't give three /29s - they just gave us a pool of 30 addresses
out of the /24, and it was luck that they were contiguous enough
contain three /30s.  (The 18 addresses in use, plus 9 more lost to the
all-0, all-1, and other-end addresses in the /30s, makes 27; the plan I
have leaves three addresses unused.)

Full detailed config information:

water: SPARCstation LX, a rather old OS (my private 1.4T-derived tree),
with a qec (a quad Ethernet card) in it.  I'm using two interfaces:

le0: flags=8863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	address: 08:00:20:1f:7c:95
	media: Ethernet autoselect (10baseT)
	status: active
	inet 10.10.10.20 netmask 0xfffffe00 broadcast 10.10.11.255
	inet6 fe80::a00:20ff:fe1f:7c95%le0 prefixlen 64 scopeid 0x1
qe0: flags=8863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	address: 02:00:02:33:ea:94
	media: Ethernet autoselect (10baseT)
	status: active
	inet 192.168.0.1 netmask 0xffffff00 broadcast 192.168.0.255
	inet6 fe80::2ff:fe33:ea94%qe0 prefixlen 64 scopeid 0x2

ahab: i386, 2.0.2, with two ethernets (three, actually, but one of them
is completely untouched since boot).  I'm using:

vr0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	address: 00:50:ba:0d:5f:f6
	media: Ethernet autoselect (100baseTX full-duplex)
	status: active
	inet 172.16.0.32 netmask 0xffffff00 broadcast 172.16.0.255
	inet alias 172.16.0.16 netmask 0xffffff00 broadcast 172.16.0.255
	inet6 fe80::250:baff:fe0d:5ff6%vr0 prefixlen 64 scopeid 0x2
vr1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	address: 00:50:ba:11:57:65
	media: Ethernet autoselect (100baseTX)
	status: active
	inet 10.10.10.84 netmask 0xfffffe00 broadcast 10.10.11.255
	inet6 fe80::250:baff:fe11:5765%vr1 prefixlen 64 scopeid 0x3

legree: i386, 3.0, with five ethernets.

ex0: flags=8a63<UP,BROADCAST,NOTRAILERS,RUNNING,ALLMULTI,SIMPLEX,MULTICAST> mtu 1500
	address: 00:10:4b:63:01:94
	media: Ethernet autoselect (100baseTX full-duplex)
	status: active
	inet 192.168.0.129 netmask 0xfffffffc broadcast 192.168.0.131
	inet alias 192.168.0.200 netmask 0xffffffff broadcast 192.168.0.200
	inet alias 192.168.0.201 netmask 0xffffffff broadcast 192.168.0.201
	inet alias 192.168.0.202 netmask 0xffffffff broadcast 192.168.0.202
	inet alias 192.168.0.203 netmask 0xffffffff broadcast 192.168.0.203
	inet alias 192.168.0.204 netmask 0xffffffff broadcast 192.168.0.204
	inet6 fe80::210:4bff:fe63:194%ex0 prefixlen 64 scopeid 0x1
vr0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	address: 00:0d:88:b5:32:43
	media: Ethernet autoselect (100baseTX full-duplex)
	status: active
	inet 192.168.0.133 netmask 0xfffffffc broadcast 192.168.0.135
	inet alias 192.168.0.210 netmask 0xffffffff broadcast 192.168.0.210
	inet alias 192.168.0.211 netmask 0xffffffff broadcast 192.168.0.211
	inet alias 192.168.0.212 netmask 0xffffffff broadcast 192.168.0.212
	inet alias 192.168.0.213 netmask 0xffffffff broadcast 192.168.0.213
	inet alias 192.168.0.214 netmask 0xffffffff broadcast 192.168.0.214
	inet6 fe80::20d:88ff:feb5:3243%vr0 prefixlen 64 scopeid 0x2
fxp0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	address: 00:02:b3:1c:74:36
	media: Ethernet autoselect (100baseTX full-duplex,flowcontrol,rxpause,txpause)
	status: active
	inet 172.16.0.1 netmask 0xffffff00 broadcast 172.16.0.255
	inet6 fe80::202:b3ff:fe1c:7436%fxp0 prefixlen 64 scopeid 0x3
fxp1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	capabilities=6<TCP4CSUM,UDP4CSUM>
	enabled=0
	address: 00:02:b3:ee:b1:5d
	media: Ethernet autoselect (100baseTX full-duplex,flowcontrol,rxpause,txpause)
	status: active
	inet 192.168.0.137 netmask 0xfffffffc broadcast 192.168.0.139
	inet alias 192.168.0.220 netmask 0xffffffff broadcast 192.168.0.220
	inet alias 192.168.0.221 netmask 0xffffffff broadcast 192.168.0.221
	inet alias 192.168.0.222 netmask 0xffffffff broadcast 192.168.0.222
	inet alias 192.168.0.223 netmask 0xffffffff broadcast 192.168.0.223
	inet alias 192.168.0.224 netmask 0xffffffff broadcast 192.168.0.224
	inet6 fe80::202:b3ff:feee:b15d%fxp1 prefixlen 64 scopeid 0x4
fxp2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	address: 00:0d:61:12:fe:f5
	media: Ethernet autoselect (100baseTX)
	status: active
	inet 10.10.10.83 netmask 0xfffffe00 broadcast 10.10.11.255
	inet6 fe80::20d:61ff:fe12:fef5%fxp2 prefixlen 64 scopeid 0x5

I rebuilt legree's kernel with pf in it, of course.  Specifically, here
are diffs from GENERIC to legree's kernel config.  (The GENERIC is
GENERIC,v 1.661.2.8, MD5 sum 6d9cfbffceb1abdc597d21f59961c7db.)  I
think the most relevant changes are that I turned on the pf and pflog
pseudo-devices, but here's the whole diff in case.

70c70
< options 	INSECURE	# disable kernel security levels - X needs this
---
> #options 	INSECURE	# disable kernel security levels - X needs this
90c90
< options 	LKM		# loadable kernel modules
---
> #options 	LKM		# loadable kernel modules
146,147c146,147
< file-system 	EXT2FS		# second extended file system (linux)
< file-system 	LFS		# log-structured file system
---
> #file-system 	EXT2FS		# second extended file system (linux)
> #file-system 	LFS		# log-structured file system
150c150
< file-system 	NTFS		# Windows/NT file system (experimental)
---
> #file-system 	NTFS		# Windows/NT file system (experimental)
155,157c155,157
< file-system 	NULLFS		# loopback file system
< file-system 	OVERLAY		# overlay file system
< file-system 	PORTAL		# portal filesystem (still experimental)
---
> #file-system 	NULLFS		# loopback file system
> #file-system 	OVERLAY		# overlay file system
> #file-system 	PORTAL		# portal filesystem (still experimental)
159,162c159,162
< file-system 	UMAPFS		# NULLFS + uid and gid remapping
< file-system 	UNION		# union file system
< file-system	CODA		# Coda File System; also needs vcoda (below)
< file-system	SMBFS		# experimental - CIFS; also needs nsmb (below)
---
> #file-system 	UMAPFS		# NULLFS + uid and gid remapping
> #file-system 	UNION		# union file system
> #file-system	CODA		# Coda File System; also needs vcoda (below)
> #file-system	SMBFS		# experimental - CIFS; also needs nsmb (below)
166,167c166,167
< #options 	FFS_EI		# FFS Endian Independent support
< options 	SOFTDEP		# FFS soft updates support.
---
> options 	FFS_EI		# FFS Endian Independent support
> #options 	SOFTDEP		# FFS soft updates support.
169c169
< options 	NFSSERVER	# Network File System server
---
> #options 	NFSSERVER	# Network File System server
175c175
< #options 	GATEWAY		# packet forwarding
---
> options 	GATEWAY		# packet forwarding
184c184
< options 	NS		# XNS
---
> #options 	NS		# XNS
186c186
< options 	ISO,TPIP	# OSI
---
> #options 	ISO,TPIP	# OSI
188,192c188,192
< options 	CCITT,LLC,HDLC	# X.25
< options 	NETATALK	# AppleTalk networking protocols
< options 	PPP_BSDCOMP	# BSD-Compress compression support for PPP
< options 	PPP_DEFLATE	# Deflate compression support for PPP
< options 	PPP_FILTER	# Active filter support for PPP (requires bpf)
---
> #options 	CCITT,LLC,HDLC	# X.25
> #options 	NETATALK	# AppleTalk networking protocols
> #options 	PPP_BSDCOMP	# BSD-Compress compression support for PPP
> #options 	PPP_DEFLATE	# Deflate compression support for PPP
> #options 	PPP_FILTER	# Active filter support for PPP (requires bpf)
231c231
< options 	EISAVERBOSE	# verbose EISA device autoconfig messages
---
> #options 	EISAVERBOSE	# verbose EISA device autoconfig messages
241c241
< options 	MCAVERBOSE	# verbose MCA device autoconfig messages
---
> #options 	MCAVERBOSE	# verbose MCA device autoconfig messages
243c243
< options 	NFS_BOOT_DHCP,NFS_BOOT_BOOTPARAM
---
> #options 	NFS_BOOT_DHCP,NFS_BOOT_BOOTPARAM
427,428c427,428
< eisa0	at mainbus?
< eisa0	at pceb?
---
> #eisa0	at mainbus?
> #eisa0	at pceb?
441c441
< mca0	at mainbus?
---
> #mca0	at mainbus?
546c546
< com*	at mca? slot ?			# 16x50s on comm boards
---
> #com*	at mca? slot ?			# 16x50s on comm boards
599,603c599,603
< ahb*	at eisa? slot ?			# Adaptec 174[02] SCSI
< ahc*	at eisa? slot ?			# Adaptec 274x, aic7770 SCSI
< bha*	at eisa? slot ?			# BusLogic 7xx SCSI
< dpt*	at eisa? slot ?			# DPT EATA SCSI
< uha*	at eisa? slot ?			# UltraStor 24f SCSI
---
> #ahb*	at eisa? slot ?			# Adaptec 174[02] SCSI
> #ahc*	at eisa? slot ?			# Adaptec 274x, aic7770 SCSI
> #bha*	at eisa? slot ?			# BusLogic 7xx SCSI
> #dpt*	at eisa? slot ?			# DPT EATA SCSI
> #uha*	at eisa? slot ?			# UltraStor 24f SCSI
637c637
< aha*	at mca? slot ?			# Adaptec AHA-1640
---
> #aha*	at mca? slot ?			# Adaptec AHA-1640
655c655
< cac*	at eisa? slot ?			# Compaq EISA array controllers
---
> #cac*	at eisa? slot ?			# Compaq EISA array controllers
659c659
< mlx*	at eisa? slot ?			# Mylex DAC960 & DEC SWXCR family
---
> #mlx*	at eisa? slot ?			# Mylex DAC960 & DEC SWXCR family
760,761c760,761
< edc*	at mca? slot ?			# IBM ESDI Disk Controllers
< ed*	at edc?
---
> #edc*	at mca? slot ?			# IBM ESDI Disk Controllers
> #ed*	at edc?
809,811c809,811
< ep*	at eisa? slot ?			# 3Com 3c579 Ethernet
< fea*	at eisa? slot ?			# DEC DEFEA FDDI
< tlp*	at eisa? slot ?			# DEC DE-425 Ethernet
---
> #ep*	at eisa? slot ?			# 3Com 3c579 Ethernet
> #fea*	at eisa? slot ?			# DEC DEFEA FDDI
> #tlp*	at eisa? slot ?			# DEC DE-425 Ethernet
879,885c879,885
< elmc*	at mca? slot ?			# 3Com EtherLink/MC (3c523)
< ep*	at mca? slot ?			# 3Com EtherLink III (3c529)
< we*	at mca? slot ?			# WD/SMC Ethernet
< ate*	at mca? slot ?			# Allied Telesis AT1720
< ne*	at mca? slot ?			# Novell NE/2 and clones
< tr*	at mca? slot ?			# IBM Token Ring adapter
< le*	at mca? slot ?			# SKNET Personal/MC2+
---
> #elmc*	at mca? slot ?			# 3Com EtherLink/MC (3c523)
> #ep*	at mca? slot ?			# 3Com EtherLink III (3c529)
> #we*	at mca? slot ?			# WD/SMC Ethernet
> #ate*	at mca? slot ?			# Allied Telesis AT1720
> #ne*	at mca? slot ?			# Novell NE/2 and clones
> #tr*	at mca? slot ?			# IBM Token Ring adapter
> #le*	at mca? slot ?			# SKNET Personal/MC2+
1250,1251c1250,1251
< pseudo-device	strip		2	# Starmode Radio IP (Metricom)
< pseudo-device	irframetty		# IrDA frame line discipline
---
> #pseudo-device	strip		2	# Starmode Radio IP (Metricom)
> #pseudo-device	irframetty		# IrDA frame line discipline
1261,1262c1261,1262
< #pseudo-device	pf			# PF packet filter
< #pseudo-device	pflog			# PF log if
---
> pseudo-device	pf			# PF packet filter
> pseudo-device	pflog			# PF log if
1266c1266
< pseudo-device	tb		1	# tablet line discipline
---
> #pseudo-device	tb		1	# tablet line discipline
1274c1274
< pseudo-device	vcoda		4	# coda minicache <-> venus comm.
---
> #pseudo-device	vcoda		4	# coda minicache <-> venus comm.
1277c1277
< pseudo-device	nsmb			# experimental - SMB requester
---
> #pseudo-device	nsmb			# experimental - SMB requester

I can provide more information if anyone is interested.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B