NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/41889: st(4) driver splits larger IO request into peaces - compability problem with other OS's
>Number: 41889
>Category: kern
>Synopsis: st(4) driver splits larger IO request into peaces -
>compability problem with other OS's
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: change-request
>Submitter-Id: net
>Arrival-Date: Fri Aug 14 11:50:00 +0000 2009
>Originator: Wolfgang Stukenbrock
>Release: NetBSD 4.0
>Organization:
Dr. Nagler & Company GmbH
>Environment:
System: NetBSD s012 4.0 NetBSD 4.0 (NSW-S012) #9: Fri Mar 13 12:31:52 CET 2009
wgstuken@s012:/usr/src/sys/arch/amd64/compile/NSW-S012 amd64
Architecture: x86_64
Machine: amd64
>Description:
The current implementation of physio() (in kern_physio.c) determines
the maximum possible transfer size of the underlying HW
be fist clamping to MAXPHYS and then asking the driver for further
restrictions.
This is fine for disk-IO, but leads to unexpected behaviour when
exchangeing tapes with other operating systems.
The "normal" semantic of a write(2) call on a tape device in
variable-block-mode is, that exactly one tape block of that size
is written to the media. (e.g. Solaris does it like this ...)
If the tape device is not able to support that size, an error is
returned.
The current NetBSD implementation does not return an error in "all"
cases - it will write multiple tape-blocks in some cases.
This may break applications on other systems or other NetBSD-kernels if
the MAXPHYS kernel options is used with a value other than the default.
The default value for MAXPHYS is currently equal to MAXBSIZE - both are
64k - for nearly all kernel configurations.
Some HW-driver (e.g. the ata/sata stuff) allow only may 64k transfers
too, but the scsi drivers allows larger blocksizes.
By the way: the 64k-limit for the max tape-blocksize is a very hard
restriction and does not make realy sense.
In order to be able to read tapes written on Solaris with 1MB blocks,
you need to set MAYPHYS to 1024k (and MAXBSIZE to 64k - otherwise you need to
fix some other parts in the kernel ...) in order read the data from the
tape.
OK so far, but the normal NetBSD kernel is not able to read anything
from such a tape - and you get a misleading error message ... ugly ..
You may be also out of luck, if you try to write e.g. 64k+1byte. At
least our VXA-tape requires a blocksize of at least 4 bytes and the splitted
transfer will be only 1 byte -> error ... This is not what I expect
from a variable block size device if the device supports blocks up to 240k!
The next compability problem occures if you have an application that
tries to write e.g. 128k (or variable sized) blocks and the
application knows the number of blocks on the tape. If you try to read
the tape on another OS, you suddently get only 64k blocks but much more ...
For TAPE read and write operations, physio() should be changed in an
manner, that it does not split transfers for tape devices.
The following patch will do this for the read() and write() system call.
It returns EINVAL and print a message on the console.
The idea of printing the messages comes from st.c, where a messages is
printed if the tape does not support the requested block size.
The readv() and writev() calls are affected in the following way:
- each single vector results in a sigle write operation as before
- no single vector is splitted into peaces anymore
I'm realy not shure about the correct behaviour of readv/writev() on
tape devices - I've never used it there ...
From my understanding the iovec structure should "only" allow the user
to split the memory space in user-level of one read() or write() operation into
several peaces. But for tape devices it should result in an operation
that act on exactly on tape-block - at least this is my understanding of it.
This not the case in the current NetBSD implemantion and also not the
case with this patch!
But I'm not shure about the correct behaviour for readv() and writev()
- so this patch only affects the single transfers of it...
The only way to archive this behaviour would be to copy anything into
one block, because some comments in physio() seems to point out some
problems when several user-level memeory peaches resides inside the
same physical page. At least as far as I understand it.
remark: this fix will return an arror after transfering some data if
readv() or writev() is used! If this is not acceptable, the size-check must be
done
prior the main transfer loop.
In this patch the exiting flag B_TAPE is (re)used. It looks a relict of
some days, because it is only defined in the header file and ask for in
specfs-code
at exactly one place. It is not used in the whole NetBSD-sources at
any other location.
The name sounds good, so I took it. Feel free to define an other name
if you prefer this - some bits are sill unused.
If the default value for MAXPHYS should be changed to something more
common for tape devices is a religion question.
There will be a limit in all cases. I would prefere to change it to at
least 256k, but you may have a different point.
remark: the default value of MAXPHYS is not affected by this patch. You
need to set it in your kernel config if you need larger block sizes.
But keep in mind to set MAXBSIZE back to 64k for the filesystem
- otherwise the kernel will not compile ....
>How-To-Repeat:
Write 10 128k-blocks on a tape on NetBSD and try to read 10 blocks on
Solaris. You will only read 640k from the tape.
Try to write a 65537 bytes block on a variable-block tape device with a
mininum block size (e.g. VXA-320) - it will fail with the message,
that the transfer size should be between 4 and 240k bytes. And 65537 is
...
>Fix:
Apply the folloing patch to /usr/src/sys/kern/kern_physio.c and
/usr/src/sys/dev/scsipi/st.c.
read() and write() call will never be splitted into several transfers
anymore.
--- kern_physio.c 2009/07/29 17:45:45 1.2
+++ kern_physio.c 2009/08/14 11:21:19
@@ -285,7 +285,7 @@
DPRINTF(("%s: called: off=%" PRIu64 ", resid=%zu\n",
__func__, uio->uio_offset, uio->uio_resid));
- flags &= B_READ | B_WRITE;
+ flags &= B_READ | B_WRITE | B_TAPE;
/* Make sure we have a buffer, creating one if necessary. */
if (obp != NULL) {
@@ -375,6 +375,15 @@
* for later comparison.
*/
(*min_phys)(bp);
+ if (flags & B_TAPE) {
+ if (bp->b_bcount != iovp->iov_len) {
+ printf("TAPE physio - request %zu, but
supported by HW max %u\n",
+ iovp->iov_len, bp->b_bcount);
+ error = EINVAL; /* do no split
IO-request on tape devices ... */
+ goto done;
+
+ }
+ }
todo = bp->b_bufsize = bp->b_bcount;
#if defined(DIAGNOSTIC)
if (todo > MAXPHYS)
--- st.c 2009/08/14 09:06:59 1.1
+++ st.c 2009/08/14 09:09:13
@@ -1380,8 +1380,13 @@
{
struct st_softc *st = st_cd.cd_devs[STUNIT(dev)];
- return (physio(ststrategy, NULL, dev, B_READ,
- st->sc_periph->periph_channel->chan_adapter->adapt_minphys, uio));
+ if (st->flags & ST_FIXEDBLOCKS) {
+ return (physio(ststrategy, NULL, dev, B_READ,
+ st->sc_periph->periph_channel->chan_adapter->adapt_minphys,
uio));
+ } else {
+ return (physio(ststrategy, NULL, dev, B_READ | B_TAPE,
+ st->sc_periph->periph_channel->chan_adapter->adapt_minphys,
uio));
+ }
}
static int
@@ -1389,8 +1394,13 @@
{
struct st_softc *st = st_cd.cd_devs[STUNIT(dev)];
- return (physio(ststrategy, NULL, dev, B_WRITE,
- st->sc_periph->periph_channel->chan_adapter->adapt_minphys, uio));
+ if (st->flags & ST_FIXEDBLOCKS) {
+ return (physio(ststrategy, NULL, dev, B_WRITE,
+ st->sc_periph->periph_channel->chan_adapter->adapt_minphys,
uio));
+ } else {
+ return (physio(ststrategy, NULL, dev, B_WRITE | B_TAPE,
+ st->sc_periph->periph_channel->chan_adapter->adapt_minphys,
uio));
+ }
}
/*
>Unformatted:
Home |
Main Index |
Thread Index |
Old Index