NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/41889: st(4) driver splits larger IO request into peaces - compability problem with other OS's



>Number:         41889
>Category:       kern
>Synopsis:       st(4) driver splits larger IO request into peaces - 
>compability problem with other OS's
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Fri Aug 14 11:50:00 +0000 2009
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 4.0
>Organization:
Dr. Nagler & Company GmbH
        
>Environment:
        
        
System: NetBSD s012 4.0 NetBSD 4.0 (NSW-S012) #9: Fri Mar 13 12:31:52 CET 2009 
wgstuken@s012:/usr/src/sys/arch/amd64/compile/NSW-S012 amd64
Architecture: x86_64
Machine: amd64
>Description:
        The current implementation of physio() (in kern_physio.c) determines 
the maximum possible transfer size of the underlying HW
        be fist clamping to MAXPHYS and then asking the driver for further 
restrictions.
        This is fine for disk-IO, but leads to unexpected behaviour when 
exchangeing tapes with other operating systems.
        The "normal" semantic of a write(2) call on a tape device in 
variable-block-mode is, that exactly one tape block of that size
        is written to the media. (e.g. Solaris does it like this ...)
        If the tape device is not able to support that size, an error is 
returned.
        The current NetBSD implementation does not return an error in "all" 
cases - it will write multiple tape-blocks in some cases.
        This may break applications on other systems or other NetBSD-kernels if 
the MAXPHYS kernel options is used with a value other than the default.

        The default value for MAXPHYS is currently equal to MAXBSIZE - both are 
64k - for nearly all kernel configurations.
        Some HW-driver (e.g. the ata/sata stuff) allow only may 64k transfers 
too, but the scsi drivers allows larger blocksizes.
        By the way: the 64k-limit for the max tape-blocksize is a very hard 
restriction and does not make realy sense.
        In order to be able to read tapes written on Solaris with 1MB blocks, 
you need to set MAYPHYS to 1024k (and MAXBSIZE to 64k - otherwise you need to
        fix some other parts in the kernel ...) in order read the data from the 
tape.
        OK so far, but the normal NetBSD kernel is not able to read anything 
from such a tape - and you get a misleading error message ... ugly ..

        You may be also out of luck, if you try to write e.g. 64k+1byte. At 
least our VXA-tape requires a blocksize of at least 4 bytes and the splitted
        transfer will be only 1 byte -> error ... This is not what I expect 
from a variable block size device if the device supports blocks up to 240k!

        The next compability problem occures if you have an application that 
tries to write e.g. 128k (or variable sized) blocks and the
        application knows the number of blocks on the tape. If you try to read 
the tape on another OS, you suddently get only 64k blocks but much more ...


        For TAPE read and write operations, physio() should be changed in an 
manner, that it does not split transfers for tape devices.

        The following patch will do this for the read() and write() system call.
        It returns EINVAL and print a message on the console.
        The idea of printing the messages comes from st.c, where a messages is 
printed if the tape does not support the requested block size.

        The readv() and writev() calls are affected in the following way:
        - each single vector results in a sigle write operation as before
        - no single vector is splitted into peaces anymore
        I'm realy not shure about the correct behaviour of readv/writev() on 
tape devices - I've never used it there ...
        From my understanding the iovec structure should "only" allow the user 
to split the memory space in user-level of one read() or write() operation into 
        several peaces. But for tape devices it should result in an operation 
that act on exactly on tape-block - at least this is my understanding of it.
        This not the case in the current NetBSD implemantion and also not the 
case with this patch!
        But I'm not shure about the correct behaviour for readv() and writev() 
- so this patch only affects the single transfers of it...
        The only way to archive this behaviour would be to copy anything into 
one block, because some comments in physio() seems to point out some
        problems when several user-level memeory peaches resides inside the 
same physical page. At least as far as I understand it.

        remark: this fix will return an arror after transfering some data if 
readv() or writev() is used! If this is not acceptable, the size-check must be 
done
                prior the main transfer loop.

        In this patch the exiting flag B_TAPE is (re)used. It looks a relict of 
some days, because it is only defined in the header file and ask for in 
specfs-code
        at exactly one place.  It is not used in the whole NetBSD-sources at 
any other location.
        The name sounds good, so I took it. Feel free to define an other name 
if you prefer this - some bits are sill unused.

        If the default value for MAXPHYS should be changed to something more 
common for tape devices is a religion question.
        There will be a limit in all cases. I would prefere to change it to at 
least 256k, but you may have a different point.
        remark: the default value of MAXPHYS is not affected by this patch. You 
need to set it in your kernel config if you need larger block sizes.
                But keep in mind to set MAXBSIZE back to 64k for the filesystem 
- otherwise the kernel will not compile ....
>How-To-Repeat:
        Write 10 128k-blocks on a tape on NetBSD and try to read 10 blocks on 
Solaris. You will only read 640k from the tape.
        Try to write a 65537 bytes block on a variable-block tape device with a 
mininum block size (e.g. VXA-320) - it will fail with the message,
        that the transfer size should be between 4 and 240k bytes. And 65537 is 
...
>Fix:
        Apply the folloing patch to /usr/src/sys/kern/kern_physio.c and 
/usr/src/sys/dev/scsipi/st.c.
        read() and write() call will never be splitted into several transfers 
anymore.


--- kern_physio.c       2009/07/29 17:45:45     1.2
+++ kern_physio.c       2009/08/14 11:21:19
@@ -285,7 +285,7 @@
        DPRINTF(("%s: called: off=%" PRIu64 ", resid=%zu\n",
            __func__, uio->uio_offset, uio->uio_resid));
 
-       flags &= B_READ | B_WRITE;
+       flags &= B_READ | B_WRITE | B_TAPE;
 
        /* Make sure we have a buffer, creating one if necessary. */
        if (obp != NULL) {
@@ -375,6 +375,15 @@
                         * for later comparison.
                         */
                        (*min_phys)(bp);
+                       if (flags & B_TAPE) {
+                               if (bp->b_bcount != iovp->iov_len) {
+                                       printf("TAPE physio - request %zu, but 
supported by HW max %u\n",
+                                               iovp->iov_len, bp->b_bcount);
+                                       error = EINVAL; /* do no split 
IO-request on tape devices ... */
+                                       goto done;
+
+                               }
+                       }
                        todo = bp->b_bufsize = bp->b_bcount;
 #if defined(DIAGNOSTIC)
                        if (todo > MAXPHYS)
--- st.c        2009/08/14 09:06:59     1.1
+++ st.c        2009/08/14 09:09:13
@@ -1380,8 +1380,13 @@
 {
        struct st_softc *st = st_cd.cd_devs[STUNIT(dev)];
 
-       return (physio(ststrategy, NULL, dev, B_READ,
-           st->sc_periph->periph_channel->chan_adapter->adapt_minphys, uio));
+       if (st->flags & ST_FIXEDBLOCKS) {
+               return (physio(ststrategy, NULL, dev, B_READ,
+                   st->sc_periph->periph_channel->chan_adapter->adapt_minphys, 
uio));
+       } else {
+               return (physio(ststrategy, NULL, dev, B_READ | B_TAPE,
+                   st->sc_periph->periph_channel->chan_adapter->adapt_minphys, 
uio));
+       }
 }
 
 static int
@@ -1389,8 +1394,13 @@
 {
        struct st_softc *st = st_cd.cd_devs[STUNIT(dev)];
 
-       return (physio(ststrategy, NULL, dev, B_WRITE,
-           st->sc_periph->periph_channel->chan_adapter->adapt_minphys, uio));
+       if (st->flags & ST_FIXEDBLOCKS) {
+               return (physio(ststrategy, NULL, dev, B_WRITE,
+                   st->sc_periph->periph_channel->chan_adapter->adapt_minphys, 
uio));
+       } else {
+               return (physio(ststrategy, NULL, dev, B_WRITE | B_TAPE,
+                   st->sc_periph->periph_channel->chan_adapter->adapt_minphys, 
uio));
+       }
 }
 
 /*

>Unformatted:
        
        


Home | Main Index | Thread Index | Old Index