Re: kern/52769: hang with an ffs stored in an nvme device

To: chs%NetBSD.org@localhost, gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost, martin%NetBSD.org@localhost
Subject: Re: kern/52769: hang with an ffs stored in an nvme device
From: Jaromír Doleček <jaromir.dolecek%gmail.com@localhost>
Date: Wed, 14 Mar 2018 21:50:01 +0000 (UTC)

The following reply was made to PR kern/52769; it has been noted by GNATS.

From: =?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?= <jaromir.dolecek%gmail.com@localhost>
To: Chuck Silvers <chuq%chuq.com@localhost>
Cc: gnats-bugs%netbsd.org@localhost, kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, 
	netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/52769: hang with an ffs stored in an nvme device
Date: Wed, 14 Mar 2018 22:49:39 +0100

 --001a11444b4424f9f10567665ac6
 Content-Type: text/plain; charset="UTF-8"

 2017-12-01 19:45 GMT+01:00 Chuck Silvers <chuq%chuq.com@localhost>:
 >   q_nccbs = 0x20,
 >   q_nccbs_avail = 0x21,

 This is highly suspicious, q_nccbs_avail should always be <= q_nccbs. Good
 that the driver deadlocked, it would panic in nvme_ccb_get() once it would
 try to get the nonexisting 33th ccb from queue :)

 This got me thinking though. If the completion queue is being processed, we
 currently don't reset q_nccbs_avail until after all finished ccbs are
 processed. While this is running, any further I/O would be skipped with
 EAGAIN, if all ccbs were taken and q_nccbs_avail was 0. When the ccb
 counter is reset on the end of nvme_q_complete(), there is no outstanding
 I/O any more which would trigger another lddone() and do the queue drain,
 so the driver ceases to process anything. This scenario matches the
 described symtoms quite well.

 Can you please try patch from
 http://www.netbsd.org/~jdolecek/nvme_avail_put.diff ?

 It's compile tested only, so might need some tweaks. The idea is to reset
 the ccb counter immediatelly, so lddone() would be able to queue another
 I/O while the completion queue is being still processed. This should also
 fix ccb leak on errors - e.g. nvme_ns_dobio() calls just nvme_ccb_put()
 when bus_dmamap_load() fails, so q_nccbs_avail stays decremented from
 nvme_ccb_get().

 Jaromir

 --001a11444b4424f9f10567665ac6
 Content-Type: text/html; charset="UTF-8"
 Content-Transfer-Encoding: quoted-printable

 <div dir=3D"ltr"><br>2017-12-01 19:45 GMT+01:00 Chuck Silvers &lt;<a href=
 =3D"mailto:chuq%chuq.com@localhost";>chuq%chuq.com@localhost</a>&gt;:<br>&gt; =C2=A0 q_nccbs =3D=
  0x20,<br>&gt; =C2=A0 q_nccbs_avail =3D 0x21,<br><br><div>This is highly su=
 spicious, q_nccbs_avail should always be &lt;=3D q_nccbs. Good that the dri=
 ver deadlocked, it would panic in nvme_ccb_get() once it would try to get t=
 he nonexisting 33th ccb from queue :)</div><div><br></div><div>This got me =
 thinking though. If the completion queue is being processed, we currently d=
 on&#39;t reset q_nccbs_avail until after all finished ccbs are processed. W=
 hile this is running, any further I/O would be skipped with EAGAIN, if all =
 ccbs were taken and q_nccbs_avail was 0. When the ccb counter is reset on t=
 he end of nvme_q_complete(), there is no outstanding I/O any more which wou=
 ld trigger another lddone() and do the queue drain, so the driver ceases to=
  process anything. This scenario matches the described symtoms quite well.<=
 /div><div><br></div><div>Can you please try patch from=C2=A0<a href=3D"http=
 ://www.netbsd.org/~jdolecek/nvme_avail_put.diff">http://www.netbsd.org/~jdo=
 lecek/nvme_avail_put.diff</a> ?</div><div><br></div><div>It&#39;s compile t=
 ested only, so might need some tweaks. The idea is to reset the ccb counter=
  immediatelly, so lddone() would be able to queue another I/O while the com=
 pletion queue is being still processed. This should also fix ccb leak on er=
 rors - e.g. nvme_ns_dobio() calls just nvme_ccb_put() when bus_dmamap_load(=
 ) fails, so q_nccbs_avail stays decremented from nvme_ccb_get().</div><div>=
 <br></div><div>Jaromir</div></div>

 --001a11444b4424f9f10567665ac6--

Follow-Ups:
- Re: kern/52769: hang with an ffs stored in an nvme device
  - From: Paul Goyette

Prev by Date: NetBSD Nightly Trouble Ticket Report
Next by Date: Re: kern/52769: hang with an ffs stored in an nvme device
Previous by Thread: Re: kern/52769: hang with an ffs stored in an nvme device
Next by Thread: Re: kern/52769: hang with an ffs stored in an nvme device
Indexes:

Home | Main Index | Thread Index | Old Index