NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/56669: crash at MegaRAID SAS 9341-8i
The following reply was made to PR kern/56669; it has been noted by GNATS.
From: mlelstv%serpens.de@localhost (Michael van Elst)
To: gnats-bugs%netbsd.org@localhost
Cc:
Subject: Re: kern/56669: crash at MegaRAID SAS 9341-8i
Date: Fri, 11 Mar 2022 10:05:16 -0000 (UTC)
6bone%6bone.informatik.uni-leipzig.de@localhost writes:
> [ 4039.509027] mfii0: cmd timeout ccb 0xffffdb010bf67400
> [ 4039.509027] mfii0: cmd timeout ccb 0xffffdb010bf67320
The RAID controller doesn't finish requests and the driver sees a timeout.
This can be a firmware bug or just a slow drive (maybe retrying a bad sector?).
> [ 4039.509027] uvm_fault(0xffffffff8196bfa0, 0x0, 2) -> e
> [ 4039.509027] fatal page fault in supervisor mode
> [ 4039.509027] trap type 6 code 0x2 rip 0xffffffff8029ddd5 cs 0x8 rflags 0x10246 cr2 0xa8 ilevel 0 rsp 0xffffdb09104d4f58
> [ 4039.509027] curlwp 0xfffffcfc2c6e7280 pid 0.487 lowest kstack 0xffffdb09104d02c0
> [ 4039.529104] vpanic() at netbsd:vpanic+0x156
> [ 4039.549169] panic() at netbsd:panic+0x3c
> [ 4039.549169] trap() at netbsd:trap+0xb27
> [ 4039.549169] --- trap (number 6) ---
> [ 4039.559027] mfii_scrub_ccb() at netbsd:mfii_scrub_ccb+0x3
> [ 4039.559027] workqueue_worker() at netbsd:workqueue_worker+0xd7
> [ 4039.559027] cpu2: End traceback...
The panic happens in mfii_scrub_ccb which is called with a NULL ccb pointer
(offset 0xa8 is ccb_cookie).
In mfii_abort_task (called by workqueue_worker):
accb = mfii_get_ccb(sc);
mfii_scrub_ccb(accb);
mfii_abort(sc, accb, periph->periph_target, ccb->ccb_smid,
MPII_SCSI_TASK_ABORT_TASK,
htole32(MFII_TASK_MGMT_FLAGS_PD));
accb->ccb_cookie = ccb;
accb->ccb_done = mfii_scsi_cmd_abort_done;
This attempts to abort the timed out request (ccb) by sending an ABORT_TASK request,
but mfii_get_ccb may fail and accb is then NULL.
In case of failure, the attempt needs to be retried later by scheduling another
work item to the sc_abort_wq. Somewhere one CCB also needs to be reserved
for this operation (the code reserves 4 CCBs, probably enough).
If that is fixed, the crash should go away. But if the timeout is caused by
a firmware problem, it's also possible that even the ABORT_TASK request cannot
stop the stalled request and disk I/O just stops.
Home |
Main Index |
Thread Index |
Old Index