Subject: kern/3842: Attempting to read past end of ccd may cause process to hang
To: None <gnats-bugs@gnats.netbsd.org>
From: Dave Huang <khym@bga.com>
List: netbsd-bugs
Date: 07/08/1997 17:20:54
>Number:         3842
>Category:       kern
>Synopsis:       Attempting to read past end of ccd may cause process to hang
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jul  8 15:35:01 1997
>Last-Modified:
>Originator:     Dave Huang
>Organization:
Name: Dave Huang     |   Mammal, mammal / their names are called /
INet: khym@bga.com   |   they raise a paw / the bat, the cat /
FurryMUCK: Dahan     |   dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 21 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++
>Release:        NetBSD-current as of July 4, 1997
>Environment:
	
System: NetBSD dahan.metonymy.com 1.2G NetBSD 1.2G (SPIFF) #58: Sat Jul 5 05:00:30 CDT 1997 khym@dahan.metonymy.com:/usr/src.local/sys/arch/i386/compile/SPIFF i386


>Description:
I've got two partitions striped together with ccd, sd0e:
#        size   offset    fstype   [fsize bsize   cpg]
  e:  1553718  1091560    4.2BSD     1024  8192    16   # (Cyl. 1082*- 2624*)

and sd1e:
#        size   offset    fstype   [fsize bsize   cpg]
  e:  1553718  1190168    4.2BSD     1024  8192    16   # (Cyl. 1062*- 2449*)

I configured the ccd with the following /etc/ccd.conf
ccd0		32	2	/dev/sd0e /dev/sd1e

then disklabeled it as:
#        size   offset    fstype   [fsize bsize   cpg]
  d:  3107436        0    4.2BSD     1024  8192    16   # (Cyl.    0 - 1517*)

However, since the partition size wasn't a multiple of the interleave
size, the ccd only had 3107392 sectors (according to ccdconfig -v).
Attempting to read from the partition with "dd if=/dev/rccd0d
of=/dev/null bs=1043968" will work until it gets toward the end of the
partition, at which point the process hangs with a WCHAN of "physio."
kill -9 won't kill the process, and shutdown -r now hangs after
printing "syncing disks..." Breaking into ddb and doing a ps shows
that the dd process is still hanging around, waiting on "physio."

I did a "call cpu_reboot(104, 0)" to get a core dump, and ran gdb -k
on it (with a kernel with debugging symbols), looked through the
"allproc" list for the dd process, then switched to it with gdb's
"proc" command, and got the following stack trace:

#0  mi_switch () at ../../../../kern/kern_synch.c:615
#1  0xf8119c51 in bpendtsleep ()
#2  0xf8114ee7 in physio (strategy=0xf810b6f0 <ccdstrategy>, bp=0xf8dbc7f8, 
    dev=4611, flags=1048576, minphys=0xf8115118 <minphys>, uio=0xfcc93f20)
    at ../../../../kern/kern_physio.c:190
#3  0xf810bdb8 in ccdread (dev=0, uio=0xfcc93f20, flags=0)
    at ../../../../dev/ccd.c:1021
#4  0xf813c2e9 in spec_read (v=0xfcc93ed8)
    at ../../../../miscfs/specfs/spec_vnops.c:259
#5  0xf819b9f1 in ufsspec_read (v=0x0) at ../../../../ufs/ufs/ufs_vnops.c:1700
#6  0xf813732f in vn_read (fp=0xf883f6c0, uio=0xfcc93f20, cred=0xf87b4300)
    at ../../../../sys/vnode_if.h:269
#7  0xf811e6e3 in sys_read (p=0xf883b200, v=0xfcc93f88, retval=0xfcc93f80)
    at ../../../../kern/sys_generic.c:112
#8  0xf81aeb58 in syscall (frame={tf_es = 31, tf_ds = 31, tf_edi = -138420804, 
      tf_esi = -138420784, tf_ebp = -138420872, tf_ebx = 8452, tf_edx = 0, 
      tf_ecx = 1, tf_eax = 3, tf_trapno = 3, tf_err = 2, tf_eip = 43983, 
      tf_cs = 23, tf_eflags = 647, tf_esp = -138420896, tf_ss = 31, 
      tf_vm86_es = 0, tf_vm86_ds = 0, tf_vm86_fs = 0, tf_vm86_gs = 0})
    at ../../../../arch/i386/i386/trap.c:623

Changing the ccd's interleave size to 127, which is a factor of the
partition size, seems to fix the problem. ccdconfig -v shows the size
to be 3107436 sectors, which agrees with the disklabel, and
everything's happy.

Anyways, to get to the main point of this PR, I think the code should
return an error when a process tries to read beyond the end of the
ccd, instead of hanging like that, especially since the process is
unkillable and prevents a clean shutdown and reboot. (I waited about
10 minutes or so to see if it'd time out, and it didn't. Was I too
impatient? It seems kinda bad in general that reboots might not work,
leaving the machine stuck until someone comes around to hit the reset
button...)

Also, this also occurs with the block device. I first noticed it when
apparently, I managed to create a file that extended past the end of
the ccd (how did that happen anyways?) There didn't seem to be any
problems creating the file, but trying to read from the file would
cause the process to hang with a WCHAN of "getblk."

>How-To-Repeat:
See above.

>Fix:

>Audit-Trail:
>Unformatted: