Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Automated report: NetBSD-current/i386 test failure



    Date:        Sun, 20 Sep 2020 04:02:45 +0100
    From:        Roy Marples <roy%marples.name@localhost>
    Message-ID:  <51d2f8dc-d059-5eae-9899-5c91539d1ac0%marples.name@localhost>

  | The test case just needed fixing.

That is not uncommon after changes elsewhere.

  | The ping to an invalid address caused the ARP entry to enter INCOMPLETE -> 
  | WAITDELETE state and this hung over into the next test casing this entry
  | to take too long to validty resolve.

Why?   If a failed ARP (or ND) causes problems for a later request
(incl of the same addr) which should work (that is, any problems at all,
including delays) then I'd consider the implementation broken (not the test).

  | The solution is after a deliberate fail

And if it wasn't a deliberate fail?  Perhaps being just a fraction of a 
second too quick, and attempting a ping (or ssh, or something) just before
the destination becomes reachable (either because it was down, unconfigured,
or the net link between then wasn't functional), and

  | to remove the ARP entry for the address

if the user doing this isn't root, and cannot just remove ARP entries?

Maybe I'm misunderstanding the actual scenario, but it seems to me
that things aren't working as well now as they were before (the timing
in the qemu tests hasn't changed recently - not since the nvmm version
started being used - but before the arp implementation change, it used
to work reliably).

  | This fixes all the test case fallout from the ARP -> ND merge and has now 
  | survived several test runs.

Yes, I have been watching, and I saw that.

  | The ND cache expiration test which intermittently fails is based on exact 
  | timings. A future patch will add jitter to NS, will cause this test to
  | fail more.  Ideas on how to solve it welcome.

Some of the tests make unsupportable assumptions, that just happen to
work when initially created.  That one might be one of those, in which
case we need to look and see what assertions can be made about the state
at various times, and make sure that the test only attempts to verify
things that ought be true - cache expiration is one of the harder ones
to deal with, as generally that just happens whenever the kernel (or
whatever is holding the cache - the kernel here) decides that now would
be a useful time.

Sometimes there the right way is not to test whether the entry has gone
from the cache, but whether it is either gone, or in a state where it could
vanish any time (eg: lifetime has decremented to 0 or whatever).  Not
always possible - some things that should happen just aren't possible to
reliably test in an automated framework like this - some cache entries
just "go away eventually", but the test cannot just wait for "eventually"
to occur, especially, as is often the case, how long that takes depends
naturally upon other activity, and in the test, there is none.

kre



Home | Main Index | Thread Index | Old Index