kern/42724: select(2) and poll(2) can return non-error status on bad file descriptors

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/42724: select(2) and poll(2) can return non-error status on bad file descriptors
From: eravin%panix.com@localhost
Date: Wed, 3 Feb 2010 00:10:01 +0000 (UTC)

>Number:         42724
>Category:       kern
>Synopsis:       select(2)  and poll(2) can return non-error status on bad file 
>descriptors
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Feb 03 00:10:00 +0000 2010
>Originator:     Ed Ravin
>Release:        5.0.1
>Organization:
PANIX Public Access Networks Corp
>Environment:
NetBSD panix3.panix.com 5.0.1 NetBSD 5.0.1 (PANIX-USER) #0: Thu Nov  5 22:13:39
EST 2009  
root%juggler.panix.com@localhost:/devel/netbsd/5.0.1/src/sys/arch/i386/compile/PANIX-USER
 i386
>Description:
we repeatedly see programs like emacs, mutt, elm, pine, trn, and nn go into 
infinite loops polling for input when the end user has lost their telnet or ssh 
session.

Here's a sample ktrace:
 19399      1 emacs-21.3 select(0x1, 0x8211000, 0, 0, 0xbf7fe7e8) = 1
 19399      1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d1744) Err#9 EBADF
 19399      1 emacs-21.3 getpid()                  = 19399, 7766
 19399      1 emacs-21.3 kill(0x4bc7, 0x1)         = 0
 19399      1 emacs-21.3 read(0, 0xbf7d1748, 0xfff) = 0
       ""
 19399      1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d174c) Err#9 EBADF
 19399      1 emacs-21.3 getpid()                  = 19399, 7766
 19399      1 emacs-21.3 kill(0x4bc7, 0x1)         = 0
 19399      1 emacs-21.3 read(0, 0xbf7d1750, 0xfff) = 0
       ""
 19399      1 emacs-21.3 select(0x1, 0x8211000, 0, 0, 0xbf7fe7e8) = 1
 19399      1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d1744) Err#9 EBADF
 19399      1 emacs-21.3 getpid()                  = 19399, 7766
 19399      1 emacs-21.3 kill(0x4bc7, 0x1)         = 0
 19399      1 emacs-21.3 read(0, 0xbf7d1748, 0xfff) = 0
       ""
 19399      1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d174c) Err#9 EBADF
 19399      1 emacs-21.3 getpid()                  = 19399, 7766
 19399      1 emacs-21.3 kill(0x4bc7, 0x1)         = 0
 19399      1 emacs-21.3 read(0, 0xbf7d1750, 0xfff) = 0
       ""


And so on ad infinitum.  Note that file descriptor #0 has been closed:
# fstat -p 19399
USER     CMD          PID   FD MOUNT       INUM MODE         SZ|DV R/W
zzz      emacs-21.3 19399   wd /net/u   6552785 drwx------    8192 r
zzz      emacs-21.3 19399    0 -         -        none    -
zzz      emacs-21.3 19399    1 -         -        none    -
zzz      emacs-21.3 19399    2 -         -        none    -

And here's the FD list:

(gdb) x/32 0x8211000
0x8211000:      0x00000001      0x00000000      0x00000000      0x00000000
0x8211010:      0x00000000      0x00000000      0x00000000      0x00000000
0x8211020:      0x1821cc34      0x00000000      0x00000000      0x00000000
0x8211030:      0x00000000      0x00000000      0x00000000      0x00000000
0x8211040:      0x00000001      0x00000000      0x00000000      0x00000000
0x8211050:      0x00000000      0x00000000      0x00000000      0x00000000
0x8211060:      0x00000000      0x00000000      0x00000000      0x00000000
0x8211070:      0x00000000      0x00000000      0x00000000      0x00000000

The version of lsof we have on this box seems to not fully understand the 
broken file descriptors:
root@panix2 ~: # lsof-NetBSD-i386-5.0_BETA -p 19399
lsof-NetBSD-i386-5.0_BETA: WARNING: compiled for NetBSD release 5.0_BETA; this 
is 5.0.1.
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
emacs-21. 19399  zzz  cwd   VDIR   11,3     8192 6552785 /net/u/1/k/zzz/News
emacs-21. 19399  zzz  txt   VREG  142,0  4561480  638894 
/usr/local/bin/emacs-21.3
emacs-21. 19399  zzz  txt   VREG  142,0  1120316  249256 /lib/libc.so.12.164
emacs-21. 19399  zzz  txt   VREG  142,0   125014  249277 /lib/libm.so.0.6
emacs-21. 19399  zzz  txt   VREG  142,0     3790  249279 /lib/libm387.so.0.1
emacs-21. 19399  zzz  txt   VREG  142,0    12875  249268 /lib/libtermcap.so.0.6
emacs-21. 19399  zzz  txt   VREG  142,0    11263  636496 
/usr/lib/libossaudio.so.0.0
emacs-21. 19399  zzz  txt   VREG  142,0    65173  635885 /libexec/ld.elf_so
emacs-21. 19399  zzz    0u                               unknown file system 
type: 0
emacs-21. 19399  zzz    1u                               unknown file system 
type: 0
emacs-21. 19399  zzz    2u                               unknown file system 
type: 0



Note that process 19399 has lost its telnetd or sshd and has only a controlling 
shell which is parented by init:

#  pstree -p 19399
-+= 00000 root [system]
 \-+= 00001 root init
   \-+= 07766 zzz -tcsh (tcsh-6.13.00)
     \--= 19399 zzz emacs (emacs-21.3)

Here's what I believe the scenario to be - when a user gets disconnected 
abnormally from an ssh or telnet session, the process should receive a HUP 
signal.  Perhaps select(2) or poll(2) are sleeping waiting on input at the 
time, and something goes wrong.  But the HUP does not get processed properly, 
and the process continues with its select/read loop, and assumes select is 
sleeping for it to wait on input.

However, select keeps returning error value 1, saying that one FD is ready to 
read, even though the FD supplied to select(2) was invalid.  The process tries 
to read, gets zero data available (that doesn't sound right either, shouldn't 
read(2) return EBADF here?), and goes back to select(2) to try again.  Since 
the process expected select(2) to sleep until I/O was available, and select(2) 
is now returning immediately, the process goes into a tight loop and hogs the 
CPU. 

Although it's clear that emacs in this case has a chance to see something's 
wrong (note the ioctl call that returns EBADF), I don't think the app is really 
at fault, since as previously stated this happens to multiple applications and 
they all exhibit the same symptoms.

We have also seen this with the poll(2) syscall.



>How-To-Repeat:
run a multi-user system with many shell users using interactive programs like 
emacs, mutt, elm, pine, trn, and nn.

wait for some of them to get accidentally disconnected.

eventually, this will happen.  we usually see it once every few days. 
>Fix:
have select return EBADF when it is given an invalid or closed FD in its list.

read(2) should also return EBADF when it is given an invalid or closed FD.

Prev by Date: Re: kern/27802: on disk full, last-edited file opened instead of binary
Next by Date: Re: kern/30349 (Input error counter of wm(4) doesn't count error frames)
Previous by Thread: Re: kern/27802: on disk full, last-edited file opened instead of binary
Next by Thread: Re: kern/42724: select(2) and poll(2) can return non-error status on bad file descriptors
Indexes:

Home | Main Index | Thread Index | Old Index