NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/52769: hang with an ffs stored in an nvme device
The following reply was made to PR kern/52769; it has been noted by GNATS.
From: Paul Goyette <paul%whooppee.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: chs%NetBSD.org@localhost, netbsd-bugs%netbsd.org@localhost, martin%NetBSD.org@localhost
Subject: Re: kern/52769: hang with an ffs stored in an nvme device
Date: Fri, 16 Mar 2018 17:07:46 +0800 (+08)
This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
--0-648942362-1521191266=:10795
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
On Wed, 14 Mar 2018, Jarom=C3=ADr Dole=C4~Mek wrote:
> > q_nccbs =3D 0x20,
> > q_nccbs_avail =3D 0x21,
>
> This is highly suspicious, q_nccbs_avail should always be <=3D q_nccbs. G=
ood
> that the driver deadlocked, it would panic in nvme_ccb_get() once it woul=
d
> try to get the nonexisting 33th ccb from queue :)
>
> This got me thinking though. If the completion queue is being processed, =
we
> currently don't reset q_nccbs_avail until after all finished ccbs are
> processed. While this is running, any further I/O would be skipped with
> EAGAIN, if all ccbs were taken and q_nccbs_avail was 0. When the ccb
> counter is reset on the end of nvme_q_complete(), there is no outstanding
> I/O any more which would trigger another lddone() and do the queue drain,
> so the driver ceases to process anything. This scenario matches the
> described symtoms quite well.
>
> Can you please try patch from
> http://www.netbsd.org/~jdolecek/nvme_avail_put.diff ?
Initial testing with this patch is looking good. I'm currently running=20
a 'cvs update' against the same tree in which I'm running a "build.sh=20
-j24 release" and so far no hang.
> It's compile tested only, so might need some tweaks. The idea is to reset
> the ccb counter immediatelly, so lddone() would be able to queue another
> I/O while the completion queue is being still processed. This should also
> fix ccb leak on errors - e.g. nvme_ns_dobio() calls just nvme_ccb_put()
> when bus_dmamap_load() fails, so q_nccbs_avail stays decremented from
> nvme_ccb_get().
Just based on reading the patch, it would appear to make sense to commit=20
even if it doesn't completely fix the hang. The patch might not be=20
"sufficient" but it would appear to be "necessary". :)
+------------------+--------------------------+----------------------------=
+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: =
|
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com =
|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org =
|
+------------------+--------------------------+----------------------------=
+
--0-648942362-1521191266=:10795--
Home |
Main Index |
Thread Index |
Old Index