tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: vnode_has_large_blocks() (vnd.c rev 1.255)



On Thu, Sep 06, 2018 at 06:46:12AM -0000, Michael van Elst wrote:
> bouyer%antioche.eu.org@localhost (Manuel Bouyer) writes:
> 
> >thinking about it more, I think this is the wrong approach. For example,
> >if the vnd is for a linux domU, getdisksize() won't return anything
> >usefull, but the I/O will probably be 4k-compatible. Even for a NetBSD domU,
> >as the xbd protocol uses 512-bytes sectors, the disklabel will always contains
> >512, even if the filesystem is properly aligned and uses 4k fragments.
> 
> The xbd protocol is a different problem. I know that the current code
> always tells the DomU that the disk has 512-byte sectors which means
> that the Dom0 must provide a device that allows 512-byte I/O. It would
> really be better if xbd wasn't lying about the geometry.

but that's not going to change. If you move a virtual machine from a 512b
to a 4k sector disk, you expect the virtual machine to still run.
If you change the virtual's disk sector size its filesystems will
probably be unusable.

> 
> >Worse, even if the disklabel contains 4k sectors, the xbd protocol may split
> >an I/O in 2 parts that are not 4k aligned.
> >I think the check should be made at I/O time. It will be cheaper than an
> >unneeded read/modify/write I/O anyway.
> 
> If the underlying device only does 4k I/O then something smaller either fails
> or needs read/modify/write on some layer.

Sure. My point is that it's probably better to decide at run time than
at init time. If the virutal machine is windows or linux, the disklabel
won't be usable and we'll default to the slow path.
And even if the disklabel says it's 512b sectors, most I/O will
probably be done in a 4k-compatible way. All my virtual machines have
4k-aligned partitions.
That's also true for the sparse file check: it's bad to always use the
slow path because there's a hole in the file, at a place that may
never be used by the domU. But this one may be harder to deal with at
runtime.

On Thu, Sep 06, 2018 at 06:16:43AM -0000, Michael van Elst wrote:
> bouyer%antioche.eu.org@localhost (Manuel Bouyer) writes:
> 
> >> The backing store and the geometry are initialized before vndthread
> >> is started, getdisksize() shouldn't fail and I'm sure it didn't
> >> at that time.
> 
> >AFAIK getdisksize() returns the parameters of the vnd device, not the
> >backing store.
> 
> It's called on sc_vp, that's the vnode opened for the backing store.
> 
> The vnd geometry is read directly from sc_geom which is initialized
> in the VNDIOCSET code before the thread is started.
> 
> So I don't think the check is done too early.

Hum, in this case, sc_geom contains what was set at VNDIOCSET time isn't it ?
Unless we provided a geometry at vnconfig time, it'll always have 512b
sectors. This is not read from the vnd's disklabel.


> 
> 
> >> But if the I/O request is smaller, it would fail. The vnode_has_large_blocks
> >> validates that this may happen and configures vnd to fall back to the
> >> slower method. This happens when the backing store is e.g. on a 4k/sector
> >> disk and the vnd device simulates a 512byte/sector disk.
> 
> >OK, I though this would be the filesystem block or fragment size.
> 
> Yes, the I/O size of the backing store in case of FFS is the
> fragment size. The minimum fragment size is the sector size. So on
> the 4k/sector disk the fragments are always 4k or larger.
> 
> Saying this, I'm not sure if the large blocks check isn't too aggressive.
> 
> Obviously it is needed to prevent I/O requests smaller than the sector
> size, and that's why the check was added. I think that anything smaller
> than the fragment size is forbidden too to avoid buffer corruption,
> but is that really true? The backing filesystem will never do I/O smaller
> than the fragment size.

Sure but that doens't seems to be a problem. the backing filesystem
is 64k/8k, yet I can use filesystems with smaller fragments in the domUs,
without problems. It looks like VOP_BMAP/VOP_STRATEGY deals with it
(actually I think it's write only the relevant physical sectors, even if
that's not a full fragment, because that's how nbp is set up).

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--


Home | Main Index | Thread Index | Old Index