mpt(4) timeout recovery improvements

To: tech-kern%netbsd.org@localhost
Subject: mpt(4) timeout recovery improvements
From: Edgar Fuß <ef%math.uni-bonn.de@localhost>
Date: Sat, 23 Nov 2013 19:07:13 +0100

A while ago I asked

> Since my mpt(4) controller looses one of its attached discs every few weeks, 
> needing a reboot and a twenty-hour RAID reconstruction, I'm thinking about 
> switching to some mpii(4)-based SAS controller.
> 
> Does someone use mpii(4) in production? Is this ready to put 250 people's 
> home mand mail dirs on?
And since no-one answered, I guess it's not ready yet and I better stick to mpt.

In the meantime, Brian Buhrow has kindly provided me with some patches that 
apparantly improve things for him. However, I'm not sure whether these patches 
do the right thing; at places, I'm quite sure they don't do the right thing for 
me.

So I'm trying to tackle this myself (based on Brian's patch) and am making some 
progress.
I need some advice from people with a better knowledge of LSI's MPI/MPT (does 
anyone have docs on that?) and a better understanding of the scsipi/driver 
interaction (is there documentation on this?)

The sequence of events on a failure in my case is

1. I get SAS link down/link up events
2. I get mpt timeouts on that disc
3. RAIDframe fails the disc
4. Occasionally, I have silent FFS corruption

After that, I need a reboot to get the SAS channel working again, followed by 
a twenty-hour RAID reconstruction.

I suspect the root cause for the link loss is a hardware issue. Unfortunately, 
I'm unsuccessful in tracking it down, so I currently have to live with it and 
work around te best I can.

The timeout presumably originates from the link loss.
The inability to recover must be either a MPT firmware bug or a defiency in 
mpt(4) or both.

The FFS corruption may be connected to improper timeout handling or be a 
seperate issue.


I'm trying to simulate the failure by (on an identical machine) running dd 
and un-plugging and re-inserting the disc. With an unpatched kernel, I get 
the same symptoms: timeout and a stuck SAS channel.

It seems to be possible to recover, by, on the timeout, reset 
(mpt_soft_reset()) and re-initialize (mpt_init()) the IOC and return all 
current commands to the scsipi layer.
Is there a less intrusive way to reset just the one MPT's SAS channel?

Now, what's the correct way of reset/init the IOC and returning everything 
to scsipi? I guess the correct order is to reset (which leave the IOC in 
the stopped state), then to set xs->error and call scsipi_done(xs) on all 
pending operations and then init the IOC (which empties the request queue).

First question: what's the appropriate xs->error? XS_TIMEOUT seems to work, 
but doesn't seem correct (save the original timed out request, of course). 
Is there some XS_NEVER_MIND_JUST_TRY_AGAIN code?

Second question: When repeatedly calling scsipi_done(), can it happen that 
scsipi tries to re-queue these requests before I return? I would then loose 
them when re-initializing the IOC.

Third question: Do I need to care about xs->xs_callout?

Or is returning everything to scsipi simply the wrong approach?

Any comments or better ideas to recover?

Follow-Ups:
- Re: mpt(4) timeout recovery improvements
  - From: Edgar Fuß
- Re: mpt(4) timeout recovery improvements
  - From: Brian Buhrow

References:
- mpii(4)
  - From: Edgar Fuß

Prev by Date: Re: in which we present an ugly hack to make sys/queue.h CIRCLEQ work
Next by Date: Re: in which we present an ugly hack to make sys/queue.h CIRCLEQ work
Previous by Thread: mpii(4)
Next by Thread: Re: mpt(4) timeout recovery improvements
Indexes:

Home | Main Index | Thread Index | Old Index