tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: vnode_has_large_blocks() (vnd.c rev 1.255)



On Sun, Sep 09, 2018 at 11:49:20AM -0000, Michael van Elst wrote:
> bouyer%antioche.eu.org@localhost (Manuel Bouyer) writes:
> 
> >I strongly dissagree. I have 512-byte-sectors domUs running for years on a
> >dom0 with 64k blocks/8k fragements. This works ! I'm I'm probably not the
> >only one, as this has been this way since we have dom0 support.
> >It's very unlikely that others have requested a 512-byte fragments FFS
> >for their domUs backing store.
> 
> Still it fails when the dom0 has disks with large sectors. So we need some
> kind of check.

I agree with this.

> 
> The (traditional) buffercache definitely cannot handle varying block sizes.
> Fortunately this is not used by FFS, the UVM pager obviously handles only
> complete pages. Maybe that makes it work accidentally.
> 
> The filesystem code however makes sure that filesystem I/O is done in
> multiples of fragment sizes and VOP_BMAP/VOP_STRATEGY was originally
> defined to work on filesystem fragments (and multiples). vnd shouldn't
> do anything else.

Actually I don't think so. VOP_BMAP returns the disk sector number
where the fragment starts. VOP_STRATEGY does a contigous I/O of
sectors, with the start and size expressed in disk sectors.
Obviously VOP_BMAP will return the start of a fragment, but then
we can ajust the buffer's b_blkno and b_bcount to the part of the
fragment we want to read (or write). This is what we do in vnd.c
This is a direct I/O so the buffer cache is not an issue here.

> 
> >The performance penaltly of VOP_READ/VOP_WRITE is just inacceptable
> >for a Xen setup.
> 
> I'm doing some benchmarks now (just for vnd, not xen yet). I can see
> a penalty of about a factor of 2 for linear I/O, much less for random I/O.

On my setup it's between 5 and 10. enough to make the disks 100% busy
with less than 1MB/s of real I/O (the disk itself does about 3MB/s read
and 5MB/s writes, but it seems to realy disklike doing a read followed
by a write to the same sector(s)).

One problem is that handle_with_rdwr() does a 64k read-modify-write,
whatever the size of the original I/O is (i.e. it's not using the
filesystem fragment size, but the filesystem block size).

But anyway, even a factor 2 is bad; at some point we were a better
dom0 than linux, performnce wise. I don't think it's true any more,
but we should not make it worse.

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--


Home | Main Index | Thread Index | Old Index