tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Complete lock-up from using pkgsrc/net/darkstat



> On NetBSD 8, 9, current, [...] Stop darkstat.  Machine locks.

> [...] in case anyone can imagine how and why a complete system lockup
> could happen as the result of an interface being used in promiscuous
> mode for long periods of time (and not when used that way for short
> periods of time.

Don't forget, it may _not_ be "as the result of an interface being used
in promiscuous mode for long periods of time".  That's merely a
correlate (and possibly not a perfect correlate - your sample size is
small); the causality may be more complicated.  (For example, maybe
it's actually as a result of receiving certain traffic which is on that
segment but which it wouldn't normally receive.  Maybe it's got nothing
to do with network interfaces and instead is related to something else
darkstat does - I know nothing about what darkstat does or doesn't do,
except for your implication that it runs interfaces promiscuous.)

You're absolutely right - I don't know this for sure, but I can add some additional information.

I've seen occasional lockups (once or twice per year) on a number of systems - at least five different systems - which are all running as NAT routers and firewalls for various heavily used networks. Two systems were running NetBSD 8 with ipfilter, one with wm* as the public interface, the other with re*. Two systems were running NetBSD 9 with npf, one with wm*, one with re*. The fifth was running 9.99.93 with re0 as the public interface and npf.

It was on this last one that I ran "/etc/rc.d/darkstat stop" and saw that it completely locked up, and I had to have someone physically go and power cycle it. I know that when interfaces switch from promiscuous to non-promiscuous, they can lose link for a moment, but the machine wasn't reachable from the internal network, either, and didn't respond when a USB keyboard was connected to it (no green lines from the kernel). Also, pressing the power button didn't trigger a poweroff event, so I know it was completely locked.

Random lockups are one thing, and a specific lockup when stopping darkstat is another, but to add to this, one location has two identical machines, one which occasionally locked under exceptionally high network load, and the other that never did. To ascertain whether it was a hardware fault, the drives were swapped between them. The problem continued and moved with the drive, so then the OSes were reinstalled, and one still kept occasionally locking up. Only after seeing the lockup when stopping darkstat did I realize that the one that continuously had occasional lockups was running darkstat on boot.

These lockups have bugged the heck out of me for many years - at least five - and I'm kicking myself that I only realize now that all the machines that were 100% stable for multiple years weren't running darkstat, and the ones that were problematic were running darkstat. I should've realized this ages ago.

It also might be relevant to note which port you're running.  It must
be capable of having re and wm interfaces, since you name them, but
that still includes a fair bit.

With five different machines, I don't think it's likely an issue with all five interfaces, but it's always better to have too much information than too little:

NetBSD 8.2 (1-May-2020):
wm0 at pci2 dev 0 function 0: Intel i82574L (rev. 0x00)
wm0: for TX and RX interrupting at msix2 vec 0 affinity to 1
wm0: for TX and RX interrupting at msix2 vec 1 affinity to 2
wm0: for LINK interrupting at msix2 vec 2
wm0: PCI-Express bus
wm0: 2048 words FLASH, version 1.8.0, Image Unique ID 0000ffff
wm0: ASPM L0s and L1 are disabled to workaround the errata.
wm0: Ethernet address 00:1b:21:b5:51:e7
wm0: 0x224480<FLASH,IOH_VALID,PCIE,ASF_FIRM,WOL>
makphy0 at wm0 phy 1: Marvell 88E1149 Gigabit PHY, rev. 1

NetBSD 8.2 (1-May-2020):
re0 at pci2 dev 0 function 0: RealTek 8168/8111 PCIe Gigabit Ethernet (rev. 0x0c)
re0: interrupting at msi2 vec 0
re0: Ethernet address 4c:cc:6a:0b:ee:1a
re0: using 256 tx descriptors
rgephy0 at re0 phy 7: RTL8251 1000BASE-T media interface, rev. 0

NetBSD 9.0 (12-June-2020):
[     1.004517] wm0 at pci3 dev 0 function 0: Intel PRO/1000 PT Quad Port Server Adapter (rev. 0x06)
[     1.004517] wm0: interrupting at ioapic1 pin 3
[     1.004517] wm0: PCI-Express bus
[     1.004517] wm0: 4096 words (16 address bits) SPI EEPROM, version 5.10.2, Image Unique ID 0000ffff
[     1.004517] wm0: ASPM L1 is disabled to workaround the errata.
[     1.004517] wm0: Ethernet address 00:15:17:73:0d:15
[     1.004517] wm0: 0x24440<SPI,IOH_VALID,PCIE,ASF_FIRM>
[     1.004517] igphy0 at wm0 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
...

NetBSD 9.1 (24-April-2021):
[     1.008819] re0 at pci6 dev 0 function 0: RealTek 8168/8111 PCIe Gigabit Ethernet (rev. 0x0c)
[     1.008819] re0: interrupting at msix2 vec 0
[     1.008819] re0: Ethernet address e0:d5:5e:48:2c:58
[     1.008819] re0: using 256 tx descriptors
[     1.008819] rgephy0 at re0 phy 7: RTL8251 1000BASE-T media interface, rev. 0

NetBSD 9.99.93 (26-February-2022):
[     1.048567] re0 at pci1 dev 0 function 0: RealTek 8168/8111 PCIe Gigabit Ethernet (rev. 0x06)
[     1.048567] re0: interrupting at msix1 vec 0
[     1.048567] re0: RTL8168E/8111E-VL (0x2c80)
[     1.048567] re0: Ethernet address 00:e0:4c:68:11:2f
[     1.048567] re0: using 256 tx descriptors
[     1.048567] rgephy0 at re0 phy 7: RTL8211E 1000BASE-T media interface


If you want to test just interfaces being promiscuous (and inevitable side effects thereof, such as receiving traffic it normally might not), then set up a bridge(4) instance and add the relevant interface to it. That will (at least in my experience) run the interface promiscuous, but shouldn't do much else. Or start tcpdump with an unlikely expression, such as "ether host 00:11:22:33:44:55", and no -p.

Good idea. I can't do this on machines that aren't physically local, but I will try my best to replicate this, perhaps with darkstat and without.

Thanks,
John


Home | Main Index | Thread Index | Old Index