Re: kern/56669: crash at MegaRAID SAS 9341-8i

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,6bone%6bone.informatik.uni-leipzig.de@localhost
Subject: Re: kern/56669: crash at MegaRAID SAS 9341-8i
From: mlelstv%serpens.de@localhost (Michael van Elst)
Date: Fri, 11 Mar 2022 10:10:02 +0000 (UTC)

The following reply was made to PR kern/56669; it has been noted by GNATS.

From: mlelstv%serpens.de@localhost (Michael van Elst)
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: kern/56669: crash at MegaRAID SAS 9341-8i
Date: Fri, 11 Mar 2022 10:05:16 -0000 (UTC)

 6bone%6bone.informatik.uni-leipzig.de@localhost writes:

 > [  4039.509027] mfii0: cmd timeout ccb 0xffffdb010bf67400
 > [  4039.509027] mfii0: cmd timeout ccb 0xffffdb010bf67320

 The RAID controller doesn't finish requests and the driver sees a timeout.
 This can be a firmware bug or just a slow drive (maybe retrying a bad sector?).

 > [  4039.509027] uvm_fault(0xffffffff8196bfa0, 0x0, 2) -> e
 > [  4039.509027] fatal page fault in supervisor mode
 > [  4039.509027] trap type 6 code 0x2 rip 0xffffffff8029ddd5 cs 0x8 rflags 0x10246 cr2 0xa8 ilevel 0 rsp 0xffffdb09104d4f58
 > [  4039.509027] curlwp 0xfffffcfc2c6e7280 pid 0.487 lowest kstack 0xffffdb09104d02c0

 > [  4039.529104] vpanic() at netbsd:vpanic+0x156
 > [  4039.549169] panic() at netbsd:panic+0x3c
 > [  4039.549169] trap() at netbsd:trap+0xb27
 > [  4039.549169] --- trap (number 6) ---
 > [  4039.559027] mfii_scrub_ccb() at netbsd:mfii_scrub_ccb+0x3
 > [  4039.559027] workqueue_worker() at netbsd:workqueue_worker+0xd7
 > [  4039.559027] cpu2: End traceback...

 The panic happens in mfii_scrub_ccb which is called with a NULL ccb pointer
 (offset 0xa8 is ccb_cookie).

 In mfii_abort_task (called by workqueue_worker):

                 accb = mfii_get_ccb(sc);
                 mfii_scrub_ccb(accb);
                 mfii_abort(sc, accb, periph->periph_target, ccb->ccb_smid,
                     MPII_SCSI_TASK_ABORT_TASK,
                     htole32(MFII_TASK_MGMT_FLAGS_PD));

                 accb->ccb_cookie = ccb;
                 accb->ccb_done = mfii_scsi_cmd_abort_done;

 This attempts to abort the timed out request (ccb) by sending an ABORT_TASK request,
 but mfii_get_ccb may fail and accb is then NULL.

 In case of failure, the attempt needs to be retried later by scheduling another
 work item to the sc_abort_wq. Somewhere one CCB also needs to be reserved
 for this operation (the code reserves 4 CCBs, probably enough).

 If that is fixed, the crash should go away. But if the timeout is caused by
 a firmware problem, it's also possible that even the ABORT_TASK request cannot
 stop the stalled request and disk I/O just stops.

Prev by Date: Re: Re: kern/56669: crash at MegaRAID SAS 9341-8i
Next by Date: Re: kern/56135 ((using WIP driver) writing to record-only audio device panics, kernel diagnostic assertion "track" at audio.c:2912)
Previous by Thread: Re: Re: kern/56669: crash at MegaRAID SAS 9341-8i
Next by Thread: misc/56670: manual entry for pthread_setschedprio() missing
Indexes:

Home | Main Index | Thread Index | Old Index