NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/56669: crash at MegaRAID SAS 9341-8i



The following reply was made to PR kern/56669; it has been noted by GNATS.

From: mlelstv%serpens.de@localhost (Michael van Elst)
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: kern/56669: crash at MegaRAID SAS 9341-8i
Date: Fri, 11 Mar 2022 10:05:16 -0000 (UTC)

 6bone%6bone.informatik.uni-leipzig.de@localhost writes:
 
 > [  4039.509027] mfii0: cmd timeout ccb 0xffffdb010bf67400
 > [  4039.509027] mfii0: cmd timeout ccb 0xffffdb010bf67320
 
 The RAID controller doesn't finish requests and the driver sees a timeout.
 This can be a firmware bug or just a slow drive (maybe retrying a bad sector?).
 
 > [  4039.509027] uvm_fault(0xffffffff8196bfa0, 0x0, 2) -> e
 > [  4039.509027] fatal page fault in supervisor mode
 > [  4039.509027] trap type 6 code 0x2 rip 0xffffffff8029ddd5 cs 0x8 rflags 0x10246 cr2 0xa8 ilevel 0 rsp 0xffffdb09104d4f58
 > [  4039.509027] curlwp 0xfffffcfc2c6e7280 pid 0.487 lowest kstack 0xffffdb09104d02c0
 
 > [  4039.529104] vpanic() at netbsd:vpanic+0x156
 > [  4039.549169] panic() at netbsd:panic+0x3c
 > [  4039.549169] trap() at netbsd:trap+0xb27
 > [  4039.549169] --- trap (number 6) ---
 > [  4039.559027] mfii_scrub_ccb() at netbsd:mfii_scrub_ccb+0x3
 > [  4039.559027] workqueue_worker() at netbsd:workqueue_worker+0xd7
 > [  4039.559027] cpu2: End traceback...
 
 The panic happens in mfii_scrub_ccb which is called with a NULL ccb pointer
 (offset 0xa8 is ccb_cookie).
 
 In mfii_abort_task (called by workqueue_worker):
 
                 accb = mfii_get_ccb(sc);
                 mfii_scrub_ccb(accb);
                 mfii_abort(sc, accb, periph->periph_target, ccb->ccb_smid,
                     MPII_SCSI_TASK_ABORT_TASK,
                     htole32(MFII_TASK_MGMT_FLAGS_PD));
                 
                 accb->ccb_cookie = ccb;
                 accb->ccb_done = mfii_scsi_cmd_abort_done;
 
 This attempts to abort the timed out request (ccb) by sending an ABORT_TASK request,
 but mfii_get_ccb may fail and accb is then NULL.
 
 In case of failure, the attempt needs to be retried later by scheduling another
 work item to the sc_abort_wq. Somewhere one CCB also needs to be reserved
 for this operation (the code reserves 4 CCBs, probably enough).
 
 If that is fixed, the crash should go away. But if the timeout is caused by
 a firmware problem, it's also possible that even the ABORT_TASK request cannot
 stop the stalled request and disk I/O just stops.
 
 


Home | Main Index | Thread Index | Old Index