NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/46464: lib/librumphijack/t_tcpip:ssh test case randomly fails



>Number:         46464
>Category:       kern
>Synopsis:       lib/librumphijack/t_tcpip:ssh test case randomly fails
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri May 18 15:40:00 +0000 2012
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current as of 2012.05.16.19.12.59
>Organization:
>Environment:
System: NetBSD
Architecture: i386
Machine: i386
>Description:

The "ssh" test case of the lib/librumphijack/t_tcpip test has been
randomly failing for a long time, perhaps its entire existence (it
was committed on 2011.02.14.15.14.00, and the first recorded failure
on my test system was ten days later at 2011.02.24.18.33.06).

The output from a recent failure can be seen at:

  
http://releng.netbsd.org/b5reports/i386/build/2012.05.16.11.45.08/test.html#lib_librumphijack_t_tcpip_ssh

The above log shows the test failing with the error message:

  Timeout, server 127.0.0.1 not responding.

Interestingly, this message is only printed in case of a keep-alive
timeout, and keep-alive timeouts can only happen if ssh is configured
with a non-zero ServerAliveInterval, but in this test, ssh is using a
configuration with the default ServerAliveInterval of zero,
so there should be no way this can happen.

To debug this, I added some printfs to print the arguments and return value of
the select() system call in src/crypto/external/bsd/openssh/dist/clientloop.c,
and found that when the error occurs, select() is returning zero (indicating
a timeout) even though its "timeout" argument is a NULL pointer.  That's
not supposed to happen, is it?  So to me this looks like a bug in select(),
or maybe just in its rump implementation.

My patch adding the printfs is at

  http://www.gson.org/netbsd/bugs/atf-ssh-test/ssh-select-debug.patch

and here is an excerpt from a /tmp/select.log file written by the patched
ssh, showing select() returning 0 even though the timeout argument is
NULL:

  select nfds=129 tvp=0x0
    read 5
    read 128
  select ret = 0

>How-To-Repeat:

  cd /usr/tests/lib/librumphijack/
  while atf-run t_tcpip; do true; done

This may fail in a test case other than the ssh one; if so, retry it.

>Fix:



Home | Main Index | Thread Index | Old Index