NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/42724: select(2) and poll(2) can return non-error status on bad file descriptors
>Number: 42724
>Category: kern
>Synopsis: select(2) and poll(2) can return non-error status on bad file
>descriptors
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Feb 03 00:10:00 +0000 2010
>Originator: Ed Ravin
>Release: 5.0.1
>Organization:
PANIX Public Access Networks Corp
>Environment:
NetBSD panix3.panix.com 5.0.1 NetBSD 5.0.1 (PANIX-USER) #0: Thu Nov 5 22:13:39
EST 2009
root%juggler.panix.com@localhost:/devel/netbsd/5.0.1/src/sys/arch/i386/compile/PANIX-USER
i386
>Description:
we repeatedly see programs like emacs, mutt, elm, pine, trn, and nn go into
infinite loops polling for input when the end user has lost their telnet or ssh
session.
Here's a sample ktrace:
19399 1 emacs-21.3 select(0x1, 0x8211000, 0, 0, 0xbf7fe7e8) = 1
19399 1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d1744) Err#9 EBADF
19399 1 emacs-21.3 getpid() = 19399, 7766
19399 1 emacs-21.3 kill(0x4bc7, 0x1) = 0
19399 1 emacs-21.3 read(0, 0xbf7d1748, 0xfff) = 0
""
19399 1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d174c) Err#9 EBADF
19399 1 emacs-21.3 getpid() = 19399, 7766
19399 1 emacs-21.3 kill(0x4bc7, 0x1) = 0
19399 1 emacs-21.3 read(0, 0xbf7d1750, 0xfff) = 0
""
19399 1 emacs-21.3 select(0x1, 0x8211000, 0, 0, 0xbf7fe7e8) = 1
19399 1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d1744) Err#9 EBADF
19399 1 emacs-21.3 getpid() = 19399, 7766
19399 1 emacs-21.3 kill(0x4bc7, 0x1) = 0
19399 1 emacs-21.3 read(0, 0xbf7d1748, 0xfff) = 0
""
19399 1 emacs-21.3 ioctl(0, FIONREAD, 0xbf7d174c) Err#9 EBADF
19399 1 emacs-21.3 getpid() = 19399, 7766
19399 1 emacs-21.3 kill(0x4bc7, 0x1) = 0
19399 1 emacs-21.3 read(0, 0xbf7d1750, 0xfff) = 0
""
And so on ad infinitum. Note that file descriptor #0 has been closed:
# fstat -p 19399
USER CMD PID FD MOUNT INUM MODE SZ|DV R/W
zzz emacs-21.3 19399 wd /net/u 6552785 drwx------ 8192 r
zzz emacs-21.3 19399 0 - - none -
zzz emacs-21.3 19399 1 - - none -
zzz emacs-21.3 19399 2 - - none -
And here's the FD list:
(gdb) x/32 0x8211000
0x8211000: 0x00000001 0x00000000 0x00000000 0x00000000
0x8211010: 0x00000000 0x00000000 0x00000000 0x00000000
0x8211020: 0x1821cc34 0x00000000 0x00000000 0x00000000
0x8211030: 0x00000000 0x00000000 0x00000000 0x00000000
0x8211040: 0x00000001 0x00000000 0x00000000 0x00000000
0x8211050: 0x00000000 0x00000000 0x00000000 0x00000000
0x8211060: 0x00000000 0x00000000 0x00000000 0x00000000
0x8211070: 0x00000000 0x00000000 0x00000000 0x00000000
The version of lsof we have on this box seems to not fully understand the
broken file descriptors:
root@panix2 ~: # lsof-NetBSD-i386-5.0_BETA -p 19399
lsof-NetBSD-i386-5.0_BETA: WARNING: compiled for NetBSD release 5.0_BETA; this
is 5.0.1.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
emacs-21. 19399 zzz cwd VDIR 11,3 8192 6552785 /net/u/1/k/zzz/News
emacs-21. 19399 zzz txt VREG 142,0 4561480 638894
/usr/local/bin/emacs-21.3
emacs-21. 19399 zzz txt VREG 142,0 1120316 249256 /lib/libc.so.12.164
emacs-21. 19399 zzz txt VREG 142,0 125014 249277 /lib/libm.so.0.6
emacs-21. 19399 zzz txt VREG 142,0 3790 249279 /lib/libm387.so.0.1
emacs-21. 19399 zzz txt VREG 142,0 12875 249268 /lib/libtermcap.so.0.6
emacs-21. 19399 zzz txt VREG 142,0 11263 636496
/usr/lib/libossaudio.so.0.0
emacs-21. 19399 zzz txt VREG 142,0 65173 635885 /libexec/ld.elf_so
emacs-21. 19399 zzz 0u unknown file system
type: 0
emacs-21. 19399 zzz 1u unknown file system
type: 0
emacs-21. 19399 zzz 2u unknown file system
type: 0
Note that process 19399 has lost its telnetd or sshd and has only a controlling
shell which is parented by init:
# pstree -p 19399
-+= 00000 root [system]
\-+= 00001 root init
\-+= 07766 zzz -tcsh (tcsh-6.13.00)
\--= 19399 zzz emacs (emacs-21.3)
Here's what I believe the scenario to be - when a user gets disconnected
abnormally from an ssh or telnet session, the process should receive a HUP
signal. Perhaps select(2) or poll(2) are sleeping waiting on input at the
time, and something goes wrong. But the HUP does not get processed properly,
and the process continues with its select/read loop, and assumes select is
sleeping for it to wait on input.
However, select keeps returning error value 1, saying that one FD is ready to
read, even though the FD supplied to select(2) was invalid. The process tries
to read, gets zero data available (that doesn't sound right either, shouldn't
read(2) return EBADF here?), and goes back to select(2) to try again. Since
the process expected select(2) to sleep until I/O was available, and select(2)
is now returning immediately, the process goes into a tight loop and hogs the
CPU.
Although it's clear that emacs in this case has a chance to see something's
wrong (note the ioctl call that returns EBADF), I don't think the app is really
at fault, since as previously stated this happens to multiple applications and
they all exhibit the same symptoms.
We have also seen this with the poll(2) syscall.
>How-To-Repeat:
run a multi-user system with many shell users using interactive programs like
emacs, mutt, elm, pine, trn, and nn.
wait for some of them to get accidentally disconnected.
eventually, this will happen. we usually see it once every few days.
>Fix:
have select return EBADF when it is given an invalid or closed FD in its list.
read(2) should also return EBADF when it is given an invalid or closed FD.
Home |
Main Index |
Thread Index |
Old Index