NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

bin/42844: esiop(4)/siop(4) can lose cmd entries under resource shortage conditions



>Number:         42844
>Category:       bin
>Synopsis:       esiop(4)/siop(4) can lose cmd entries under resource shortage 
>conditions
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Feb 18 22:20:00 +0000 2010
>Originator:     Michael L. Hitch
>Release:        NetBSD 5.99.24 and 5.0
>Organization:
        Montana State University
>Environment:
        
        
System: NetBSD netbsd0.msu.montana.edu 5.99.24 NetBSD 5.99.24 
(GENERIC-$Revision: 1.330 $) #7: Thu Feb 18 10:24:23 MST 2010 
mhitch%net4.msu.montana.edu@localhost:/home/mhitch/NetBSD-current/OBJ/alphaev56/home/mhitch/NetBSD-current/src/sys/arch/alpha/compile/LOCKDEBUG
 alpha
Architecture: alpha
Machine: alpha
>Description:
The siop(4) and esiop(4) driver will dynmaically grow "resources" as needed, and
actually starts out with none.  The resources in this case are cmd structures 
used
for each disk I/O transaction.  The driver will allocate a page of memory, setup
cmd structures in that page, and add them to the free list.  The adapter 
openings
are ajusted accordingly.

The problem starts when the driver tries to process an adapter request.  It will
remove a cmd structure from the free list and proceeds to fill in the cmd to
perform the requested operation.  There are several places where the cmd setup
can fail, in which case the driver will indicate a resource failure or a driver
"stuffup" and and terminate the request.  Howver, the driver fails to return the
now unused cmd to the free list, and the number of free items will no longer
match the adapter openings available.  The free list can run out of entries
and the driver will return a resource failure, and the upper layer will sleep
for a short time and retry the request later.  A particularly bad failure can
occur if the bus_dma_load fails due to another resource failure (bus_dma_load
will return EAGAIN), and the driver reports XS_DRIVER_STUFFUP.  This fails
the I/O request, but doesn't appear to report this failure back to the user
program, and seems to result in corrupted data in the file.  This can be
hard to detect some times, since accessing the file seems to be using the
data in the memory buffers, so appears all right.  Once the buffers have been
flushed (either over time as other data replaces the buffer or a system reboot),
then accessing the file will result in the corrupted data.
>How-To-Repeat:
Perform operations on files that write large amounts of data.  In my particular
case, the files are db(1) files with a few *vary* large records (it's a perl
script that keeps large amounts of information in hashes that are all written
as a single record in the db database).  It will also occur when doing a cp
of the database file to another file (I was trying to get backup copies of the
files).  After a reboot, or time enough for the buffer cache to be flushed,
the perl script would fail trying to update the db files.  An easy way to
verify the corrupted data was to use dump to backup the files and restore
them at another location (I did it to a different system) and compare md5
sums of the original and restored files.
>Fix:
When terminating the adapter request after the cmd has been removed from
the free list, put that cmd back on the free list before returing.  This
will keep from losing the cmd structures, and keeps the number of free entries
in sync with the adapter openings.  Check the return from the bus_dma_load()
operations, and return XS_RESOURCE_SHORTAGE if the error return was EAGAIN.
One other minor change is to adjust one of the error messages that occurs
when bus_dma_load() fails.  One bus_dma_load() is to set up the DMA information
for the cmd structure itself, and a second bus_dma_load() is done for the
actual data buffer.  Both messages indicated a 'cmd' failure; change the 2nd
to indicate 'data'.  Then exactly which bus_dma_load() failed can be determined.

The patches I am now using have made my CS20 much more reliable:  I no long
get large numbers of adapter resource shortage messages, and the failures to
load the DMA map no longer appear to corrupt my files.


Index: sys/dev/ic/esiop.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ic/esiop.c,v
retrieving revision 1.42
diff -u -p -r1.42 esiop.c
--- sys/dev/ic/esiop.c  8 Apr 2008 12:07:26 -0000       1.42
+++ sys/dev/ic/esiop.c  16 Feb 2010 03:36:18 -0000
@@ -1555,6 +1555,7 @@ esiop_scsipi_request(chan, req, arg)
                                aprint_error_dev(&sc->sc_c.sc_dev, "can't 
malloc memory for "
                                    "target %d\n", target);
                                xs->error = XS_RESOURCE_SHORTAGE;
+                               TAILQ_INSERT_TAIL(&sc->free_list, esiop_cmd, 
next);
                                scsipi_done(xs);
                                splx(s);
                                return;
@@ -1581,6 +1582,7 @@ esiop_scsipi_request(chan, req, arg)
                                    "target %d lun %d\n",
                                    target, lun);
                                xs->error = XS_RESOURCE_SHORTAGE;
+                               TAILQ_INSERT_TAIL(&sc->free_list, esiop_cmd, 
next);
                                scsipi_done(xs);
                                splx(s);
                                return;
@@ -1598,7 +1600,10 @@ esiop_scsipi_request(chan, req, arg)
                if (error) {
                        aprint_error_dev(&sc->sc_c.sc_dev, "unable to load cmd 
DMA map: %d\n",
                            error);
-                       xs->error = XS_DRIVER_STUFFUP;
+                       xs->error = (error == EAGAIN) ? XS_RESOURCE_SHORTAGE :
+                                                       XS_DRIVER_STUFFUP;
+                       esiop_cmd->cmd_c.status = CMDST_FREE;
+                       TAILQ_INSERT_TAIL(&sc->free_list, esiop_cmd, next);
                        scsipi_done(xs);
                        splx(s);
                        return;
@@ -1610,9 +1615,12 @@ esiop_scsipi_request(chan, req, arg)
                            ((xs->xs_control & XS_CTL_DATA_IN) ?
                             BUS_DMA_READ : BUS_DMA_WRITE));
                        if (error) {
-                               aprint_error_dev(&sc->sc_c.sc_dev, "unable to 
load cmd DMA map: %d",
+                               aprint_error_dev(&sc->sc_c.sc_dev, "unable to 
load data DMA map: %d",
                                    error);
-                               xs->error = XS_DRIVER_STUFFUP;
+                               xs->error = (error == EAGAIN) ? 
XS_RESOURCE_SHORTAGE :
+                                                               
XS_DRIVER_STUFFUP;
+                               esiop_cmd->cmd_c.status = CMDST_FREE;
+                               TAILQ_INSERT_TAIL(&sc->free_list, esiop_cmd, 
next);
                                scsipi_done(xs);
                                bus_dmamap_unload(sc->sc_c.sc_dmat,
                                    esiop_cmd->cmd_c.dmamap_cmd);
Index: sys/dev/ic/siop.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ic/siop.c,v
retrieving revision 1.87
diff -u -p -r1.87 siop.c
--- sys/dev/ic/siop.c   8 Apr 2008 12:07:27 -0000       1.87
+++ sys/dev/ic/siop.c   16 Feb 2010 03:36:18 -0000
@@ -1281,6 +1281,7 @@ siop_scsipi_request(chan, req, arg)
                                aprint_error_dev(&sc->sc_c.sc_dev, "can't 
malloc memory for "
                                    "target %d\n", target);
                                xs->error = XS_RESOURCE_SHORTAGE;
+                               TAILQ_INSERT_TAIL(&sc->free_list, siop_cmd, 
next);
                                scsipi_done(xs);
                                splx(s);
                                return;
@@ -1300,6 +1301,7 @@ siop_scsipi_request(chan, req, arg)
                                aprint_error_dev(&sc->sc_c.sc_dev, "can't alloc 
lunsw for target %d\n",
                                    target);
                                xs->error = XS_RESOURCE_SHORTAGE;
+                               TAILQ_INSERT_TAIL(&sc->free_list, siop_cmd, 
next);
                                scsipi_done(xs);
                                splx(s);
                                return;
@@ -1317,6 +1319,7 @@ siop_scsipi_request(chan, req, arg)
                                    "target %d lun %d\n",
                                    target, lun);
                                xs->error = XS_RESOURCE_SHORTAGE;
+                               TAILQ_INSERT_TAIL(&sc->free_list, siop_cmd, 
next);
                                scsipi_done(xs);
                                splx(s);
                                return;
@@ -1334,7 +1337,10 @@ siop_scsipi_request(chan, req, arg)
                if (error) {
                        aprint_error_dev(&sc->sc_c.sc_dev, "unable to load cmd 
DMA map: %d\n",
                            error);
-                       xs->error = XS_DRIVER_STUFFUP;
+                       xs->error = (error == EAGAIN) ? XS_RESOURCE_SHORTAGE :
+                                                       XS_DRIVER_STUFFUP;
+                       siop_cmd->cmd_c.status = CMDST_FREE;
+                       TAILQ_INSERT_TAIL(&sc->free_list, siop_cmd, next);
                        scsipi_done(xs);
                        splx(s);
                        return;
@@ -1346,9 +1352,12 @@ siop_scsipi_request(chan, req, arg)
                            ((xs->xs_control & XS_CTL_DATA_IN) ?
                             BUS_DMA_READ : BUS_DMA_WRITE));
                        if (error) {
-                               aprint_error_dev(&sc->sc_c.sc_dev, "unable to 
load cmd DMA map: %d",
+                               aprint_error_dev(&sc->sc_c.sc_dev, "unable to 
load data DMA map: %d",
                                    error);
-                               xs->error = XS_DRIVER_STUFFUP;
+                               xs->error = (error == EAGAIN) ? 
XS_RESOURCE_SHORTAGE :
+                                                               
XS_DRIVER_STUFFUP;
+                               siop_cmd->cmd_c.status = CMDST_FREE;
+                               TAILQ_INSERT_TAIL(&sc->free_list, siop_cmd, 
next);
                                scsipi_done(xs);
                                bus_dmamap_unload(sc->sc_c.sc_dmat,
                                    siop_cmd->cmd_c.dmamap_cmd);

>Unformatted:
        
        


Home | Main Index | Thread Index | Old Index