Subject: discovering underlying drive names
To: None <tech-kern@NetBSD.org>
From: Robert Elz <kre@munnari.OZ.AU>
List: tech-kern
Date: 02/14/2006 22:10:27
There has been talk, at various stages, of a method that should
be developed to allow the actual drives (spindles) that underly
a filesystem (device upon which a filesystem may reside) to
be located.

I've been wanting this for ages - my typical use of a pair of
larg(ish) drives is to take approx 2/3 of each, and make that
a raid1 (ie: mirror), and take the other 1/3 or each, and make
that a raid0 (ie: strip).

Then I put "junk" filesystems on the raid0 (/usr/obj, ...) and
"real" filesystems (/home, ...) on the raid1.

All this stuff works just fine - except, fsck sees filesystems on
raid0? and raid1?, decides they are different physical drives, and
proceeds to run parallel checks on the those two devices, both
of which then hammer the underlying real drives.

This was all find back in the days when one filesystem meant one
physical drive, but those days are long gone.

I have neither the time nor the immediate knowledge of all of the
filesystem/device/vnode/... interfaces to do this properly, but
below is a partial implementation of what I believe is needed
to add the necessary mechanism.

I offer this in the hope that others who know other parts of the
system will add the missing pieces.   If the system parts get finished,
I'll take the preen code in fsck and give it a major overhaul,
to actually use all of this stuff,

The ioctl() design (the arg usage in particular) may seem a little
odd - it was done this way in order to make stacked devices
(cgd, raidframe, ...) easy to implement.

The result will be a buffer (initially passed to the ioctl) that contains
a sequence of null terminated strings, one after the other) with the
string pointer left pointing at the next byte that would be written
(one past the last \0 that was added).    The length field initially
contains the buffer length, when all is done it contains the number of
bytes remaining in the buffer (if positive) or the number of bytes more
that would be needed in he buffer for everything that needs to be put
there.

There is no attempt to optimise the result - if a device appears twice,
then it simply appears twice (the caller needs to look after that kind
of thing) - that is, if you have two raid1's each using wd0 and wd1
(for some exotic reason) and a ccd joining the two raid filesystems, the
result should be something like
	ccd0 raid0 wd0 wd1 raid1 wd0 wd1
(with nulls instead of spaces as separators).

The code below is woefully incomplete - I have no idea how to find the
component devices of a raidframe, so that part is all missing.  I also
have no idea how to find the device that (eventually) backs the space
under a vnd, so that part is missing as well.  ccd should be more or less
OK however, as should wd and sd (other devices I didn't bother with,
that's just more copies of the same basic code).

Anyway, consider this a challenge - someone fill in one of the missing
pieces, then someone else add another...

The diff below is against today's current.  It includes no man page
diff, as I couldn't find an appropriate man page to update.

kre

ps: I don't care if the end result looks anything like what is below,
I just want something to provide this information.

And, truth in advertising - none of this has even seen a compiler yet,

Index: dev/ccd.c
===================================================================
RCS file: /cvsroot/NetBSD/src/sys/dev/ccd.c,v
retrieving revision 1.107
diff -u -r1.107 ccd.c
--- dev/ccd.c	11 Dec 2005 12:20:53 -0000	1.107
+++ dev/ccd.c	14 Feb 2006 14:17:01 -0000
@@ -1297,6 +1297,39 @@
 		break;
 #endif
 
+	case DIOCGSPINDLES:
+	    {
+		struct dk_spindleinfo *si = (void *)data;
+		ssize_t len;
+		size_t used;
+
+		error = 0;
+		len = si->si_remaining;
+		used = strlen(cs->sc_xname) + 1;
+		if (len > 0 && len >= used) {
+			error = ioctl_copyout(flag, &cs->sc_xname,
+			    si->si_spindles, used);
+			if (error != 0)
+				break;
+			si->si_spindles += used;
+		}
+		si->si_remaining -= used;
+
+		/*
+		 * Now add the spindle name for all the components.
+		 * Return the first error we encounter (should be none).
+		 */
+		uc = (p != NULL) ? p->p_ucred : NOCRED;
+		for (i = 0; i < cs->sc_nccdisks; i++) {
+			j = VOP_IOCTL(cs->sc_cinfo[i].ci_vp, cmd, data,
+				      flag, uc, l);
+			if (error == 0)
+				error = j;
+		}
+
+		break;
+	    }
+
 	default:
 		error = ENOTTY;
 	}
Index: dev/vnd.c
===================================================================
RCS file: /cvsroot/NetBSD/src/sys/dev/vnd.c,v
retrieving revision 1.140
diff -u -r1.140 vnd.c
--- dev/vnd.c	4 Feb 2006 13:40:38 -0000	1.140
+++ dev/vnd.c	14 Feb 2006 14:33:19 -0000
@@ -1277,6 +1277,40 @@
 		break;
 #endif
 
+	case DIOCGSPINDLES:
+	    {
+		struct dk_spindleinfo *si = (void *)data;
+		ssize_t len;
+		size_t used;
+
+		error = 0;
+		len = (ssize_t)si->si_remaining;
+		used = strlen(vnd->sc_dev.dv_xname) + 1;
+		if (len > 0 && len >= used) {
+			error = ioctl_copyout(flag, &vnd->sc_dev.dv_xname,
+			    si->si_spindles, used);
+			if (error != 0)
+				break;
+			si->si_spindles += used;
+		}
+		si->si_remaining -= used;
+
+		/*
+		 * here need to find the device that holds the file
+		 * that is backing the vnd device.   Then we need to
+		 * call its ioctl (via VOP_IOCTL()) passing through
+		 * the (updated) paramaters to vndioctl().
+		 *
+		 * For any NFS mount, just copyout "nfs"; for tmpfs
+		 * or mfs, call VOP_IOCTL() for  all devices with active
+		 * swap partitions (recursing for swap on file)
+		 * for other filesystems, recurse until we hit a
+		 * device (md is OK), or nfs.
+		 */
+		/* XXX Code Missing XXX */
+		break;
+	    }
+
 	default:
 		return (ENOTTY);
 	}
Index: dev/ata/wd.c
===================================================================
RCS file: /cvsroot/NetBSD/src/sys/dev/ata/wd.c,v
retrieving revision 1.318
diff -u -r1.318 wd.c
--- dev/ata/wd.c	15 Jan 2006 19:51:06 -0000	1.318
+++ dev/ata/wd.c	14 Feb 2006 14:27:35 -0000
@@ -1480,6 +1480,26 @@
 		return 0;
 	    }
 
+	case DIOCGSPINDLES:
+	    {
+		struct dk_spindleinfo *si = (void *)addr;
+		ssize_t len;
+		size_t used;
+
+		len = (ssize_t)si->si_remaining;
+		used = strlen(wd->sc_dev.dv_xname) + 1;
+		if (len > 0 && len >= used) {
+			error = ioctl_copyout(flag, &wd->sc_dev.dv_xname,
+			    si->si_spindles, used);
+			if (error != 0)
+				return error;
+			si->si_spindles += used;
+		}
+		si->si_remaining -= used;
+
+		return 0;
+	    }
+
 	default:
 		return ENOTTY;
 	}
Index: dev/raidframe/rf_netbsdkintf.c
===================================================================
RCS file: /cvsroot/NetBSD/src/sys/dev/raidframe/rf_netbsdkintf.c,v
retrieving revision 1.199
diff -u -r1.199 rf_netbsdkintf.c
--- dev/raidframe/rf_netbsdkintf.c	8 Jan 2006 22:26:30 -0000	1.199
+++ dev/raidframe/rf_netbsdkintf.c	14 Feb 2006 14:28:46 -0000
@@ -1581,6 +1581,41 @@
 		break;
 #endif
 
+	case DIOCGSPINDLES:
+	    {
+		struct dk_spindleinfo *si = (void *)addr;
+		ssize_t len;
+		size_t used;
+
+		retcode = 0;
+		len = (ssize_t)si->si_remaining;
+		used = strlen(rs->sc_xname) + 1;
+		if (len > 0 && len >= used) {
+			retcode = ioctl_copyout(flag, &rs->sc_xname,
+			    si->si_spindles, used);
+			if (retcode != 0)
+				break;
+			si->si_spindles += used;
+		}
+		si->si_remaining -= used;
+
+		/*
+		 * Now add the spindle name for all the components
+		 * return the first error we encounter (should be none).
+		 */
+		for (column = 0; column < raidPtr->numCol; column++) {
+			/*
+			 * do something to call VOP_IOCTL()
+			 * for each component drive, simply passing
+			 * paramaters from this ioctl through.
+			 * (component order unimportant, no extra
+			 * ioctl related housekeeping needed)
+			 */
+			/* XXX XXX XXX */
+		}
+		break;
+	    }
+
 	default:
 		retcode = ENOTTY;
 	}
Index: dev/scsipi/sd.c
===================================================================
RCS file: /cvsroot/NetBSD/src/sys/dev/scsipi/sd.c,v
retrieving revision 1.244
diff -u -r1.244 sd.c
--- dev/scsipi/sd.c	11 Dec 2005 12:23:50 -0000	1.244
+++ dev/scsipi/sd.c	14 Feb 2006 14:24:17 -0000
@@ -1022,6 +1022,7 @@
 		case OSCIOCIDENTIFY:
 		case SCIOCCOMMAND:
 		case SCIOCDEBUG:
+		case DIOCGSPINDLES:
 			if (part == RAW_PART)
 				break;
 		/* FALLTHROUGH */
@@ -1233,6 +1234,26 @@
 		return (dkwedge_list(&sd->sc_dk, dkwl, l));
 	    }
 
+	case DIOCGSPINDLES:
+	    {
+		struct dk_spindleinfo *si = (void *)addr;
+		ssize_t len;
+		size_t used;
+
+		len = (ssize_t)si->si_remaining;
+		used = strlen(sc->sc_dev.dv_xname) + 1;
+		if (len > 0 && len >= used) {
+			error = ioctl_copyout(flag, &sc->sc_dev.dv_xname,
+			    si->si_spindles, used);
+			if (error != 0)
+				return error;
+			si->si_spindles += used;
+		}
+		si->si_remaining -= used;
+
+		return 0;
+	    }
+
 	default:
 		if (part != RAW_PART)
 			return (ENOTTY);
Index: sys/dkio.h
===================================================================
RCS file: /cvsroot/NetBSD/src/sys/sys/dkio.h,v
retrieving revision 1.12
diff -u -r1.12 dkio.h
--- sys/dkio.h	26 Dec 2005 10:36:47 -0000	1.12
+++ sys/dkio.h	14 Feb 2006 14:35:14 -0000
@@ -98,4 +98,11 @@
 #define	DIOCGSTRATEGY	_IOR('d', 125, struct disk_strategy)
 #define	DIOCSSTRATEGY	_IOW('d', 126, struct disk_strategy)
 
+struct dk_spindleinfo {
+	int32_t		si_remaining;	/* bytes available in spindles */
+	caddr_t		si_spindles;	/* next byte to write */
+};
+
+#define	DIOCGSPINDLES	_IOWR('d', 127, struct dk_spindleinfo)	/* spindles */
+
 #endif /* _SYS_DKIO_H_ */