Re: Help with issue with mpt(4) driver

To: Brian Buhrow <buhrow%nfbcal.org@localhost>
Subject: Re: Help with issue with mpt(4) driver
From: Eduardo Horvath <eeh%NetBSD.org@localhost>
Date: Tue, 15 Jan 2013 16:37:12 +0000 (UTC)

On Mon, 14 Jan 2013, Brian Buhrow wrote:

>       Hello.  I'm working on some patches to make the LSI Fusion SCSI driver
> (mpt(4)) more robust.  I'm making good progress, but I've run into a n
> issue that has momentarily baffled me.  If I get a bunch of concurrent jobs
> running on a filesystem mounted on a raid set  using disks across two
> mpt(4) instances, they get into a state where they become deadlocked and
> all but one of the processes is stuck in tstile, and the other remaining
> process is in uvn_fp2.  All the processes are trying to read the same file
> in the filesystem, not write it, but read it.  I have a debug version of
> the kernel, and the machine is running, and other operations against the
> filesystem work fine and complete successfully. I'm assuming the problem is
> something I've introduced into the mpt(4) driver, though I'm not sure how
> at the moment, sinceI've not been able to reproduce it In an alternative
> environment.
>       When a process gets into uvn_fp2 state, it's waiting for something to
> find it pages.  Is there  a way to figure out what it's waiting for and
> which underlying kernel process the uvn_fp2 call is  expecting to wake it
> up?
> 
> Any help on this issue would be greatly appreciated.  I can give a lot more
> details if someone is interested.

If you take a look in uvm_findpage() you'll see that the wait address for 
uvn_fp2 should be the page structure itself.  You can dump the page 
structure and look at the flags and the lock structure to figure out what 
state it's in.

Given that you're fiddling around with mpt, the most likely reason for 
this sort of behavior is that a disk transaction has been lost.  The 
operation may have been lost because of some locking issue in the 
completion callback, but most likely the firmware lost track of the 
operation.  

If you're writing a SCSI driver properly, you should have a list of all 
outstanding operations, and each should have a timeout associated with it 
so the driver can determine it's been dropped somewhere and can be aborted 
and retried.  The NetBSD mpt driver does not appear to do that.  This 
tends to be a problem with LSI's drivers.  They like to assume that the 
firmware is faultless, something that is usually not the case.

I generally allocate an array for outstanding commands and use the array 
index for the identifier I give to the firmware.  Of course, this does 
put a hard limit on the number of outstanding commands at any one time.  
But if the array fills up it can be reallocated on the fly without losing 
outstanding command IDs.  

You also need to be careful with command timeouts on certain devices.  
While a one or two minute timeout should be plenty for a disk type device, 
some operations on SCSI tape drives can take hours to complete.

Eduardo

Follow-Ups:
- Re: Help with issue with mpt(4) driver
  - From: Manuel Bouyer

References:
- Help with issue with mpt(4) driver
  - From: Brian Buhrow

Prev by Date: Re: Help with issue with mpt(4) driver
Next by Date: Re: Help with issue with mpt(4) driver
Previous by Thread: Re: Help with issue with mpt(4) driver
Next by Thread: Re: Help with issue with mpt(4) driver
Indexes:

Home | Main Index | Thread Index | Old Index