Subject: [wrstuden@netbsd.org: Re: DEV_BSIZE]
To: None <tech-kern@netbsd.org>
From: Reinoud Zandijk <reinoud@netbsd.org>
List: tech-kern
Date: 09/08/2005 03:16:36
--y0ulUmNC+osPPQO6
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

forwarding discussion

----- Forwarded message from Bill Studenmund <wrstuden@netbsd.org> -----

Date: Wed, 7 Sep 2005 16:20:14 -0700
From: Bill Studenmund <wrstuden@netbsd.org>
To: Reinoud Zandijk <reinoud@netbsd.org>
Cc: Bill Studenmund <wrstuden@netbsd.org>
Subject: Re: DEV_BSIZE
In-Reply-To: <20050907221559.GB20409@netbsd.org>
User-Agent: Mutt/1.4.1i

On Wed, Sep 07, 2005 at 03:15:59PM -0700, Bill Studenmund wrote:
> On Wed, Sep 07, 2005 at 11:05:00PM +0200, Reinoud Zandijk wrote:
> > Any idea's how to fix? It would be a pitty if i would have to lie in th=
e=20
> > disktabel about sector sizes. Could it be in "sys/dev/scsipi/cd.c" ? It=
s=20
> > code is quite so ugly that maybe i need to reimplement its block-interf=
ace.=20
> > But would that solve the problem?
>=20
> Well, what exactly is the problem you want to fix?
>=20
> To be honest, I don't see a reason we need to get rid of DEV_BSIZE, if we=
=20
> permit working with disks that are multiples of its size. And that's kind=
a=20
> why the work has stalled. :-) That and the fact with UBC we will face all=
=20
> sorts of pain if we ever use a disk whose block size doesn't go into a=20
> VM page evenly.

I'm sorry if that was a bit gruff, but the main thing I am wondering is=20
why do you want to get rid of it. We wanted to get rid of it before=20
because every device had to pretend it had DEV_BSIZE blocks. We have since=
=20
softened that, and we should be able to deal with non-512-byte block=20
devices with a normal kernel. I think. Also, we now transfer most of the=20
data to and from disk via UVM (due to UBC), so things being VM-page=20
aligned is a bigger deal than it was for 1.3/1.4 when we last dived into=20
this hard.

Also, we have the problem that the disklabel is not only the kernel's idea=
=20
about how the disk is layed out but it is also the label on the disk=20
describing how the disk is layed out. So lying in struct disklabel is not=
=20
at all acceptable. I think the best direction for us to go is to solve=20
whatever problems we see here as we move towards wedges. They are new and=
=20
touch on most all of the issues in this area, so let's make them right=20
from the beginning.

Take care,

Bill


!DSPAM:431f7707122401501913811!

----- End forwarded message -----
----- Forwarded message from Bill Studenmund <wrstuden@netbsd.org> -----

Date: Wed, 7 Sep 2005 18:03:39 -0700
From: Bill Studenmund <wrstuden@netbsd.org>
To: Reinoud Zandijk <reinoud@netbsd.org>
Cc: Bill Studenmund <wrstuden@netbsd.org>
Subject: Re: DEV_BSIZE
In-Reply-To: <20050908001019.GA23259@rangerover.13thmonkey.org>
User-Agent: Mutt/1.4.1i

On Thu, Sep 08, 2005 at 02:10:19AM +0200, Reinoud Zandijk wrote:
> On Wed, Sep 07, 2005 at 03:15:59PM -0700, Bill Studenmund wrote:
> > On Wed, Sep 07, 2005 at 11:05:00PM +0200, Reinoud Zandijk wrote:
> > I'm confused by your question in the tech-kern note, "Does this mean th=
at=20
> > archs that do have a MD bounds_check_with_label never have MBR=20
> > support/access to f.e. msdosfs formatted USB keys? Is this relevant or=
=20
> > will msdosfs work fine without an MBR?" What does that line have to do=
=20
> > with MBR access?
>=20
> i meant that the MBR code in "sys/kern/subr_disk_mbr.c" can't be included=
=20
> in other architectures since it defines the "bounds_check_with_label()"=
=20
> function for i386, amd64 etc. Other architectures need to code MBR suppor=
t=20
> code too.

Ok. What happens there is that they include a local MBR reader. See=20
read_dos_label() in sys/arch/macppc/macppc/disksubr.c.

> > DEV_BSIZE is a compile-time constant. Almost all NetBSD installations h=
ave=20
> > it at 512. Other OSs have defaulted to other values, though. NeXTStep=
=20
> > (sp?) used 1024, and I worked with some 4.3BSD systems that used 2048.
> >=20
> > So I guess my main question is why do you want to change it?
>=20
> I don't want to change it! I want to abolish it! no more references to=20
> DEV_BSIZE for disc-like devices instead of their own sector sizes. That=
=20
> would also greatly reduce complexity.

I doubt that it will _greatly_ reduce complexity. Perhaps it will reduce=20
confusion, but I think we will run into the same fundamental issues in a=20
different guise. I'm not at all certain we can't reduce the confusion in=20
other ways.

> > One of the considerations we have in place in our kernel is that we wan=
t a=20
> > disklabel generated on a system with DEV_BSIZE 1k or 2k to just work on=
 a=20
> > DEV_BSIZE 512 system. So we will throw in factors of 2x or 4x as needed.
>=20
> true, thats why i would like to only see references to *sector size units=
*=20
> on the specified device and NOT to some compile time variable size units.

Why? You then have to smush the sector size into a lot of other places.

While I agree it has an artistic beauity, I went down that path and found=
=20
out it doesn't really make much difference.

All of this was hashed out in three PRs, back in the 1.2 time frame. 3790,=
=20
3791, and 3792. Unfortunately the developer who was working on it, Koji=20
Imada, died in a motorcycle acident before he was able to complete it.

One of the options is what you describe and what I started doing;=20
dev_bzise is a per-device parameter. One was that everything was in=20
DEV_BSIZE (=3D=3D 512) units but had to also line up with the underlying=20
device. That's what we've done.

> > I believe the second line, the one with (lp->d_secsize / DEV_BSIZE) in =
it,=20
> > is correct. I believe it is correct as an undocumented feature of the=
=20
> > block cache is that it uses DEVBSIZE blocks. lp->d_secpercyl is the num=
ber=20
> > of lp->d_secsize sectors per cylinder, so the (lp->d_secsize / DEV_BSIZ=
E)=20
> > factor compensates.
>=20
> eeuuwwww... wouldn't it be better to buffer them on their own sizes? Its=
=20
> specified for filingsystems to only be presented buffers with multiples o=
f=20
> their own filingsystem sector size. I was made aware of the on tech-kern.

No, it would not be "better." Perhaps artisticly cleaner, but not better.

One of the main things is that we then have to propogate dev_bsize to each=
=20
file system, so that they can correctly translate between fs blocks and=20
disk blocks. If we leave things as they are, they don't and we only have=20
to concentrate the changes in the disk drivers.

I personally think it's easier to keep the sector-size mapping code in the=
=20
disk drivers. There aren't that many of them commonly used (sd, cd, and wd=
=20
are the main ones), and they will all have identical-looking code. If we=20
push this into file systems, we then have to teach each one of them about=
=20
it. And since they are all different, we have to know each one of them and=
=20
adjust accordingly.

> > > Only now i'm running into the problem that if i `dd' from /dev/cd0a i=
t=20
> > > stops at the disclabel's num sectors * 512 bytes i.e. whereas if i `d=
d'=20
> > > from the raw /dev/cd0d partition it stops at the correct place.
> > >=20
> > > Any idea's how to fix? It would be a pitty if i would have to lie in =
the=20
> > > disktabel about sector sizes. Could it be in "sys/dev/scsipi/cd.c" ? =
Its=20
> > > code is quite so ugly that maybe i need to reimplement its block-inte=
rface.=20
> > > But would that solve the problem?
> >=20
> > Well, what exactly is the problem you want to fix?
>=20
> well that it at least would result in the same file :-D i.e. 1000 sectors=
=20
> times 512 bytes/sector being equal to 250 sectors of 2048 bytes/sector=20
> read.

I'm sorry. My original question was what overall question do you want to=20
fix, not what consequence of your partial change do you need help with.

> > To be honest, I don't see a reason we need to get rid of DEV_BSIZE, if =
we=20
> > permit working with disks that are multiples of its size. And that's ki=
nda=20
> > why the work has stalled. :-) That and the fact with UBC we will face a=
ll=20
> > sorts of pain if we ever use a disk whose block size doesn't go into a=
=20
> > VM page evenly.
>=20
> That could mean that only *parst* of a sector could be kept in cache=20
> resulting in lots of RMW actions if they are written back. Now thats quit=
e=20
> a big hit on performance.
>=20
> As far as i understand UBC, the whole idea is to allways have everything=
=20
> memory mapped and demand paged in etc. This means that the buffering is=
=20
> most likely to be done in VM page sizes and not in DEV_BSIZE anyway.=20

Yes. The main thing is that my original changes, which are lying=20
abandonded in a branch, supported non 2^n block sizes. That was the big=20
win I saw of doing things in terms of a device's dev_bsize. Turns out=20
that's going to be a HUGE pain, so it doesn't matter.

> Communication has to be done in filingsystem sector size=20
> (mountpoint->statfs->(f_bsize, f_iosize)) units anyway so filingsystems=
=20
> ought to deal with the translation to device sector sizes since they are=
=20
> the only ones to know how a filingsystem ought to behave on a different=
=20
> device sector sizes. You see this f.e. in the patch for MSDOSFS on=20
> different sector sizes (one of PR# 22529, 17398, 18482, 20934, 2896; esp.=
=20
> 17398).

Yes, but "filing system sector size" may not be either DEV_BSIZE or the=20
disk's block size. Consider taking a file system from a 2k MO drive and=20
putting it on a disk w/ 1k sectors in a kernel with DEV_BSIZE 512.

While I'd love to clean things up, I think having file systems deal with=20
DEV_BSIZE is the easiest. I was mostly done with a change that did what=20
you describe, and I abandoned it; it's not that essential. :-)

> P.S. is it Ok if i forward this mail to tech-kern? (not yet done offcourc=
e)

Yes. Thank you for asking.

Take care,

Bill


!DSPAM:431f8f76203081195715915!

----- End forwarded message -----

--y0ulUmNC+osPPQO6
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (NetBSD)

iQEVAwUBQx+Q5oKcNwBDyKpoAQKfTwf+IsjP6N/FTiIvEVHmnhjd/84ADm5FFQGj
qRTdKqkhRdybzkObNYlntgnqARob4uFqH4sUUsq9I5xkq4QeJpn9HUnWvzt04ee9
NXswEmcaQgCew/lX+GM+5tIeKCCFevquvINmxUWIw/iswF0i22uZIh/rqigPLX91
fYeTLpTS8B51pzTCqWAV2PPikq1kKD9AJ10X64w9AhcAw/pEXT7g1jJSo9noQglr
g8CfkdZq8dCvOgtye24hBRsw7IkOjShvZ8OCjYxcS3bl/OwhV/eIqwjZjWEy1Dl3
r4atgfPWpIcvNj7LefLY9Lc153p941Q1bhbYwuYqwg8QWePeuCmstQ==
=hQmB
-----END PGP SIGNATURE-----

--y0ulUmNC+osPPQO6--