Subject: Re: Supporting sector size != DEV_BSIZE
To: None <tech-kern@netbsd.org>
From: Trevin Beattie <trevin@xmission.com>
List: tech-kern
Date: 05/31/2002 16:30:16
At 03:04 PM 5/31/2002 -0700, Bill Studenmund wrote:
>On Fri, 31 May 2002, Trevin Beattie wrote:
>
>> I was looking through the bounds_check_with_label function
>> (arch/i386/i386/disksubr.c, which, BTW, is not called for partition 'd'),
>
>Partition 'd' is the whole disk, which isn't checked. That's part of the
>design.

Right.  That was in reference to my previous note to Chuck, when I found
that I could access all sectors on the 'd' partition, but only 1/4 of the
sectors on partitions 'a' and 'c'.  I figured the bounds check must skip
'd' in case the partition table is incorrect.

>> and almost right away noticed a bad inconsistency.  The function starts by
>> converting bytes to sectors:
>>
>> 481:	sz = howmany(bp->b_bcount, lp->d_secsize);
>>
>> It then adds the sector count to the _block number_ and checks the result
>> against the partition size:
>>
>> 483:	if (bp->b_blkno + sz > p->p_size) {
>> 484:		sz = p->p_size - bp->b_blkno;
>>
>> If the request runs past the end of the partition, it truncates the size
>> and converts the number of *DEV_BSIZE blocks* to bytes:
>>
>> 496:		bp->b_bcount = sz << DEV_BSHIFT;
>>
>> Obviously, one or more of the above lines is incorrect and must be changed,
>> because line 496 is inconsistent with line 481.  But I have no idea what
>> the correct change should be, because I can't find any formal definition
>> for the units used by b_blkno.  <sys/buf.h> defines this parameter as the
>> "underlying physical block number"; but is that a physical sector or a
>> DEV_BSIZE block?
>
>b_blkno is in units of DEV_BSIZE.

Then there are no exceptions?  This should help in auditing the file system
code.

>> Can anybody refer me to the design specs for the file system?
>
>You mean the buffer cache? :-)

Not just that, but the block device drivers (e.g. sd.c), ffs, ufs, layered
file systems, I/O-related system calls, etc.  I read Chuck's Usenix paper
on the UBC, and it's very good, but it isn't a design spec.  I'm looking
more for something that encompasses section 9 of the man pages, with
additional detail on the parameters / structures passed to and from each
function, and especially (for solving this particular problem) which
variables have units of physical sectors (which vary between devices),
logical blocks (which vary by file system), and the DEV_BSIZE constant.
And maybe some rationale too, so I know *why* b_blkno is neither the
physical sector nor the file system block number.

I've made some patches to i386/disksubr.c and dev/vnd.c (shown at the end)
which fixed the boundary check problem, and got newfs to work.  Now I'm
looking into why the mount procedure can't find the superblock, and I have
a couple of design-related questions:

Where should the disk label be?  <i386/disklabel.h> defines LABELSECTOR as
1, but when I write the disk label on my 2048-Bps disk, the label is
written at offset 0x200 on sector 0.  In i386/disksubr.c, we have:
	215:	dospartoff = dp->mbrp_start;
and
	247:	bp->b_blkno = dospartoff + LABELSECTOR;
.  In the MBR partition table, all offsets and sizes are in units of
physical sectors.  So the right side of the equation would seem to be the
sector number.  But b_blkno on the left side is supposed to be in units of
DEV_BSIZE, so the assignment is invalid.  I can correct the first part of
the assignment with:
	247:	bp->b_blkno = dospartoff * (lp->d_secsize / DEV_BSIZE)
, but should LABELSECTOR be scaled as well or not?  In other words, should
the disk label be a constant number of sectors from the start of a
partition, or a constant number of bytes from the start?  It can't be both.

Likewise, where should the superblock be?  At a constant
(partition-relative) sector offset or a constant byte offset?

-- Trevin

Here are my patches so far; feel free to comment:

--- /usr/src/sys/arch/i386/i386/disksubr.c	Wed Feb 20 04:10:35 2002
+++ /home/trevin/src/sys/arch/i386/i386/disksubr.c	Fri May 31 12:31:00 2002
@@ -140,6 +140,7 @@
 	struct disklabel *dlp;
 	char *msg = NULL;
 	int dospartoff, cyl, i, *ip;
+	int sectodbs = lp->d_secsize / DEV_BSIZE;
 
 	/* minimal requirements for archtypal disk label */
 	if (lp->d_secsize == 0)
@@ -170,13 +171,13 @@
 
 	/* do dos partitions in the process of getting disklabel? */
 	dospartoff = 0;
-	cyl = LABELSECTOR / lp->d_secpercyl;
+	cyl = LABELSECTOR / (lp->d_secpercyl * sectodbs);
 	if (!osdep)
 		goto nombrpart;
 	dp = osdep->dosparts;
 
 	/* read master boot record */
-	bp->b_blkno = MBR_BBSECTOR;
+	bp->b_blkno = MBR_BBSECTOR * sectodbs;
 	bp->b_bcount = lp->d_secsize;
 	bp->b_flags |= B_READ;
 	bp->b_cylinder = MBR_BBSECTOR / lp->d_secpercyl;
@@ -244,7 +245,7 @@
 
 nombrpart:
 	/* next, dig out disk label */
-	bp->b_blkno = dospartoff + LABELSECTOR;
+	bp->b_blkno = dospartoff * sectodbs + LABELSECTOR;
 	bp->b_cylinder = cyl;
 	bp->b_bcount = lp->d_secsize;
 	bp->b_flags &= ~(B_DONE);
@@ -388,6 +389,7 @@
 	struct buf *bp;
 	struct disklabel *dlp;
 	int error, dospartoff, cyl;
+	int sectodbs = lp->d_secsize / DEV_BSIZE;
 
 	/* get a buffer and initialize it */
 	bp = geteblk((int)lp->d_secsize);
@@ -395,13 +397,13 @@
 
 	/* do dos partitions in the process of getting disklabel? */
 	dospartoff = 0;
-	cyl = LABELSECTOR / lp->d_secpercyl;
+	cyl = LABELSECTOR / (lp->d_secpercyl * sectodbs);
 	if (!osdep)
 		goto nombrpart;
 	dp = osdep->dosparts;
 
 	/* read master boot record */
-	bp->b_blkno = MBR_BBSECTOR;
+	bp->b_blkno = MBR_BBSECTOR * sectodbs;
 	bp->b_bcount = lp->d_secsize;
 	bp->b_flags |= B_READ;
 	bp->b_cylinder = MBR_BBSECTOR / lp->d_secpercyl;
@@ -433,7 +435,7 @@
 #endif
 
 	/* next, dig out disk label */
-	bp->b_blkno = dospartoff + LABELSECTOR;
+	bp->b_blkno = dospartoff * sectodbs + LABELSECTOR;
 	bp->b_cylinder = cyl;
 	bp->b_bcount = lp->d_secsize;
 	bp->b_flags &= ~(B_DONE);
@@ -475,13 +477,14 @@
 	int wlabel;
 {
 	struct partition *p = lp->d_partitions + DISKPART(bp->b_dev);
-	int labelsector = lp->d_partitions[2].p_offset + LABELSECTOR;
+	int sectodbs = lp->d_secsize / DEV_BSIZE;
+	int labelblock = lp->d_partitions[2].p_offset * sectodbs + LABELSECTOR;
 	int sz;
 
-	sz = howmany(bp->b_bcount, lp->d_secsize);
+	sz = howmany(bp->b_bcount, DEV_BSIZE);
 
-	if (bp->b_blkno + sz > p->p_size) {
-		sz = p->p_size - bp->b_blkno;
+	if (bp->b_blkno + sz > p->p_size * sectodbs) {
+		sz = p->p_size * sectodbs - bp->b_blkno;
 		if (sz == 0) {
 			/* If exactly at end of disk, return EOF. */
 			bp->b_resid = bp->b_bcount;
@@ -497,9 +500,9 @@
 	}
 
 	/* Overwriting disk label? */
-	if (bp->b_blkno + p->p_offset <= labelsector &&
+	if (bp->b_blkno + p->p_offset * sectodbs <= labelblock &&
 #if LABELSECTOR != 0
-	    bp->b_blkno + p->p_offset + sz > labelsector &&
+	    bp->b_blkno + p->p_offset * sectodbs + sz > labelblock &&
 #endif
 	    (bp->b_flags & B_READ) == 0 && !wlabel) {
 		bp->b_error = EROFS;
@@ -507,8 +510,8 @@
 	}
 
 	/* calculate cylinder for disksort to order transfers with */
-	bp->b_cylinder = (bp->b_blkno + p->p_offset) /
-	    (lp->d_secsize / DEV_BSIZE) / lp->d_secpercyl;
+	bp->b_cylinder = (bp->b_blkno / sectodbs + p->p_offset)
+		/ lp->d_secpercyl;
 	return (1);
 
 bad: