current-users: bug with securelevel >= 1 vs. read-only mounted "disks"

Subject: bug with securelevel >= 1 vs. read-only mounted "disks"
To: NetBSD-current Discussion List <current-users@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: current-users
Date: 03/16/2001 00:03:44
Today I spent most of the day recovering my router's root filesystem
after accidentally writing a floppy image over the beginning of it.  I
had needed to make a floppy this morning and tried making it on my
router, but I was confused by a number of syslog messages spewing on the
screen as I was typing and managed to hit return before I reviewed my
command line....  :-(

(And the worst part was I'd just finished rebuilding it almost from
scratch earlier this week after the motherboard had failed during an
attempt to upgrade it!)

At first I blamed myself for not setting the securelevel up high enough
to prevent such stupid mistakes.  Now that I've got the machine up and
running (mostly) fine again I've discovered that I did indeed have the
securelevel set to one!

If I understand this chunk of code from miscfs/specfs/spec_vnops.c
correctly, only opens of the actual block device node corresponding to
the character device currently being opened will be blocked, but other
potentially overlapping partitions are still left writable (or even
readable in the case of securelevel>=2).

        case VCHR:
                if ((u_int)maj >= nchrdev)
                        return (ENXIO);
                if (ap->a_cred != FSCRED && (ap->a_mode & FWRITE)) {
                        /*
                         * When running in very secure mode, do not allow
                         * opens for writing of any disk character devices.
                         */
                        if (securelevel >= 2 && cdevsw[maj].d_type == D_DISK)
                                return (EPERM);
                        /*
                         * When running in secure mode, do not allow opens
                         * for writing of /dev/mem, /dev/kmem, or character
                         * devices whose corresponding block devices are
                         * currently mounted.
                         */
                        if (securelevel >= 1) {
                                if ((bdev = chrtoblk(dev)) != (dev_t)NODEV &&
                                    vfinddev(bdev, VBLK, &bvp) &&
                                    bvp->v_usecount > 0 &&
                                    (error = vfs_mountedon(bvp)))
                                        return (error);
                                if (iskmemdev(dev))
                                        return (EPERM);
                        }
                }
                if (cdevsw[maj].d_type == D_TTY)
                        vp->v_flag |= VISTTY;
                VOP_UNLOCK(vp, 0);
                error = (*cdevsw[maj].d_open)(dev, ap->a_mode, S_IFCHR, p);
                vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
                return (error);

The manual [init(8)] however uses the word "disk", not "devices" in its
discussion of securelevel values:

     1     Secure mode - system immutable and system append-only flags may not
           be turned off; disks for mounted filesystems, /dev/mem, and
           /dev/kmem are read-only.

Seems the manual is wrong, or at least very misleading....

The word "disk" to me implies the entire physical device, not just the
corresponding partition, so either all device nodes corresponding to the
same physical spindle of any mounted filesystem must be blocked from
being opened, or at minimum all nodes of any partitions that possibly
overlap mounted partitions, should be blocked.  The latter is of course
a more complex check, but both checks would seem to require assistance
from the device driver in some way or another since that's where the
determination of which minor's match any given spindle is made.

Unless I'm mistaken this is a very serious and wide open security hole
(even if it only protects against fat fingers), since every disk has a
fully overlapping partition by default (and i386 systems have two that
would meet most hacker's needs).  The current situation is an almost
totally false sense of security!  (It might be possible to block such
opens, and any attempts to update the partition table, without going to
a securelevel > 1 by removing all overlapping partitions, but that's as
equally drastic and un-desirable as upping the securelevel!)

I don't know if this particular aspect of this issue has been discussed
before or not, but I certainly want to raise it again now after having
been bit by it!

I'd very much prefer fixing the code rather than documenting the current
behaviour.  The securelevel feature is indeed a lot too coarse-grained,
but it's helpful none-the-less and I think it's important to get the
basic functionality useful for the most common usages.

I'm not sure just how the changes should be designed though -- it would
seem wrong to put them in vfs_mountedon(), and I'm more inclined to
think a call-back to the drivers would be best....

FYI this router of mine is a little i386 (now a Pentium-S 150MHz w/16MB)
machine with a pair of IDE drives (just 245+153MB) and a few Ethernet
cards in it.  It's running a build from my current source tree, which is
now a rather dated version imported from NetBSD-current 2000/09/21.
Examination of more recent code shows no relevant changes that I can
spot though....


Since I've been having troubles making the primary disk boot again since
this episode (I'm booting wd0a:netbsd from floppy right now) I decided
to wipe out sector zero for certain and reset the MBR, etc., keeping a
transcript to show the problem:

First of note that we're running with securelevel=1 and the filesystems
are all mounted:

	# sysctl -a | fgrep secur
	kern.securelevel = 1
	# mount          
	/dev/wd0a on / type ffs (local)
	/dev/wd1a on /var type ffs (local)
	mfs:663 on /tmp type mfs (asynchronous, local, nosuid)
	kernfs on /kern type kernfs (local)
	procfs on /proc type procfs (local)

Now let's demonstrate the problem:

	# dd if=/dev/zero of=/dev/rwd0d count=1
	1+0 records in
	1+0 records out
	512 bytes transferred in 1 secs (512 bytes/sec)

And finally prove that the demonstration "worked":

	# fdisk -i wd0
	fdisk: invalid fdisk partition table found
	NetBSD disklabel disk geometry:
	cylinders: 723 heads: 13 sectors/track: 51 (663 sectors/cylinder)
	
	BIOS disk geometry:
	cylinders: 722 heads: 13 sectors/track: 51 (663 sectors/cylinder)
	
	Do you want to change our idea of what BIOS thinks? [n] n
	
	We haven't written the MBR back to disk yet.  This is your last chance.
	NetBSD disklabel disk geometry:
	cylinders: 723 heads: 13 sectors/track: 51 (663 sectors/cylinder)
	
	BIOS disk geometry:
	cylinders: 722 heads: 13 sectors/track: 51 (663 sectors/cylinder)
	
	Partition table:
	0: <UNUSED>
	1: <UNUSED>
	2: <UNUSED>
	3: <UNUSED>
	Should we write new partition table? [n] y

(and of course even the success of fdisk writing the MBR is another
indication of the danger here!)

After fixing the partition table yet another demonstration of the
problem is possible with installboot(8), the manual page of which also
claims the raw disk cannot be written to "securelevel set to one if the
``boot'' partition is mounted."

	# cd /usr/mdec
	# ./installboot -v biosboot.sym /dev/rwd0a
	biosboot.sym: entry point 0x805c000
	proto bootblock size 49152
	room for 10 filesystem blocks at 0x578
	renamed //boot -> //boot.bak
	Will load 81 blocks.
	dblk: 160560, num: 16
	dblk: 160576, num: 16
	dblk: 160592, num: 16
	dblk: 160608, num: 16
	dblk: 160624, num: 16
	dblk: 162008, num: 1
	installboot: open raw partition RW: Device busy
	renaming //boot.bak -> //boot

Even that works just fine despite my securelevel == 1....

(I do though have a vague memory of not being able to do an installboot
on some machine in multi-user mode once before though....  Maybe that
was FreeBSD though -- their vfs_mountedon() function looks a *lot*
different, though I'm not sure it does anything much different since
they seem to just have a pointer to the corresponding raw device in
every struct vnode, which of course makes its check very simple.)

Just for interest here's the disklabel:

	# disklabel wd0
	# /dev/rwd0d:
	type: unknown
	disk: router
	label: 
	flags:
	bytes/sector: 512
	sectors/track: 51
	tracks/cylinder: 13
	sectors/cylinder: 663
	cylinders: 723
	total sectors: 479349
	rpm: 3600
	interleave: 1
	trackskew: 0
	cylinderskew: 0
	headswitch: 0           # microseconds
	track-to-track seek: 0  # microseconds
	drivedata: 0 
	
	8 partitions:
	#        size   offset     fstype   [fsize bsize   cpg]
	  a:   419240       51     4.2BSD     1024  8192    16   # (Cyl.    0*- 632*)
	  b:    60058   419291       swap                        # (Cyl.  632*- 722)
	  c:   479298       51     unused        0     0         # (Cyl.    0*- 722)
	  d:   479349        0     unused        0     0         # (Cyl.    0 - 722)


(let me know if I should turn this message into a PR....  I did a text
search for "securelevel" in the PR database but came up effectively
empty handed with only closed entries for unrelated problems)

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>