[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly fails
>Synopsis: lib/librumphijack/t_tcpip:ssh test case randomly fails
>Arrival-Date: Fri May 18 15:40:00 +0000 2012
>Originator: Andreas Gustafsson
>Release: NetBSD-current as of 2012.05.16.19.12.59
The "ssh" test case of the lib/librumphijack/t_tcpip test has been
randomly failing for a long time, perhaps its entire existence (it
was committed on 2011.02.14.15.14.00, and the first recorded failure
on my test system was ten days later at 2011.02.24.18.33.06).
The output from a recent failure can be seen at:
The above log shows the test failing with the error message:
Timeout, server 127.0.0.1 not responding.
Interestingly, this message is only printed in case of a keep-alive
timeout, and keep-alive timeouts can only happen if ssh is configured
with a non-zero ServerAliveInterval, but in this test, ssh is using a
configuration with the default ServerAliveInterval of zero,
so there should be no way this can happen.
To debug this, I added some printfs to print the arguments and return value of
the select() system call in src/crypto/external/bsd/openssh/dist/clientloop.c,
and found that when the error occurs, select() is returning zero (indicating
a timeout) even though its "timeout" argument is a NULL pointer. That's
not supposed to happen, is it? So to me this looks like a bug in select(),
or maybe just in its rump implementation.
My patch adding the printfs is at
and here is an excerpt from a /tmp/select.log file written by the patched
ssh, showing select() returning 0 even though the timeout argument is
select nfds=129 tvp=0x0
select ret = 0
while atf-run t_tcpip; do true; done
This may fail in a test case other than the ssh one; if so, retry it.
Main Index |
Thread Index |