Subject: Re: Filesystems vs. device sector sizes
To: Pavel Cahyna <pavel@netbsd.org>
From: Bill Stouder-Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 07/26/2007 12:05:53
--TB36FDmn/VVEgNH/
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jul 26, 2007 at 11:36:36AM +0200, Pavel Cahyna wrote:
> On Wed, Jul 25, 2007 at 10:46:25PM -0700, Bill Stouder-Studenmund wrote:
> >=20
> > If EFS really really needs files smaller than device blocks, you need t=
o=20
> > use something like vnd. I don't envision us ever hacking the buffer cac=
he=20
> > to handle sub-device-block entities.
>=20
> Could the driver (cd) be taught to support 512b requests from upper layers
> by splitting sectors itself? That is, effectively pretend that the device
> block is 512b?
>=20
> Actually this is apparently already implemented. cd.c contains:
> -----
> 	/*
> 	 * If the disklabel sector size does not match the device
> 	 * sector size we may need to do some extra work.
> 	 */
> 	if (lp->d_secsize !=3D cd->params.blksize) {
>=20
> 		/*
> 		 * If the xfer is not a multiple of the device block size
> 		 * or it is not block aligned, we need to bounce it.
> -----
>=20
> But apparently you need to set sector size in the disklabel to 512b
> otherwise cdstrategy will reject such requests.

If the disk isn't labeled to need 512-byte sectors, cdstrategy certainly=20
should reject such i/o. Given that these discs were created for SGI=20
systems that had 512b i/o, they probably have 512b-sector disklabels.

Otherwise, I think we should use vnd. There are other disc and disk=20
technologies that use large sectors, so we'll need to solve this problem=20
more than once. Or fix it in a way that's reusable.

> Maybe it is easier than teaching other parts of the kernel that
> DEV_BSIZE is not a constant.

That's not the problem, though. As of now, DEV_BSIZE just happens to be a=
=20
constant that we use to label struct buf offsets and block counts.

The problem is that the file system in question here, EFS, wants to use=20
i/o transfer sizes that are smaller than the smallest the device will do.=
=20
My recollection of the discussions back when Koji was working on this was=
=20
that this problem was considered a subcase of the DEV_BSIZE issues. It was=
=20
also considered more of a specialized case, and as such would/could/should=
=20
have a separate solution. Like vnd.

> What is the "device block" for devices that can perform reads in smaller
> chunks than writes, anyway? The write unit size or read unit size? (iirc
> DVD+RW and RAID 5 arrays are examples of this.)

I'm not sure about DVD+RW, but RAID 5 can write in the same size units it=
=20
can read. RAID arrays show up as disk drives, and they support=20
sector-sized i/o. So you can write just one sector on a RAID 5.

You're right that you don't WANT to do this often (and that writing a=20
whole stripe at once is MUCH better), but you can do it. :-)

Take care,

Bill

--TB36FDmn/VVEgNH/
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (NetBSD)

iD8DBQFGqPCQWz+3JHUci9cRAtLuAJ4vcrfZZvVRDTQDpeAUbw9ELXEhIgCfUKRM
vkE2+CjFd+xnTCo7yFnyBYk=
=+JTQ
-----END PGP SIGNATURE-----

--TB36FDmn/VVEgNH/--