NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/52769: hang with an ffs stored in an nvme device



The following reply was made to PR kern/52769; it has been noted by GNATS.

From: Paul Goyette <paul%whooppee.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: chs%NetBSD.org@localhost, netbsd-bugs%netbsd.org@localhost, martin%NetBSD.org@localhost
Subject: Re: kern/52769: hang with an ffs stored in an nvme device
Date: Fri, 16 Mar 2018 17:07:46 +0800 (+08)

   This message is in MIME format.  The first part should be readable text,
   while the remaining parts are likely unreadable without MIME-aware tools.
 
 --0-648942362-1521191266=:10795
 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
 Content-Transfer-Encoding: QUOTED-PRINTABLE
 
 On Wed, 14 Mar 2018, Jarom=C3=ADr Dole=C4~Mek wrote:
 
 > >   q_nccbs =3D 0x20,
 > >   q_nccbs_avail =3D 0x21,
 >
 > This is highly suspicious, q_nccbs_avail should always be <=3D q_nccbs. G=
 ood
 > that the driver deadlocked, it would panic in nvme_ccb_get() once it woul=
 d
 > try to get the nonexisting 33th ccb from queue :)
 >
 > This got me thinking though. If the completion queue is being processed, =
 we
 > currently don't reset q_nccbs_avail until after all finished ccbs are
 > processed. While this is running, any further I/O would be skipped with
 > EAGAIN, if all ccbs were taken and q_nccbs_avail was 0. When the ccb
 > counter is reset on the end of nvme_q_complete(), there is no outstanding
 > I/O any more which would trigger another lddone() and do the queue drain,
 > so the driver ceases to process anything. This scenario matches the
 > described symtoms quite well.
 >
 > Can you please try patch from
 > http://www.netbsd.org/~jdolecek/nvme_avail_put.diff ?
 
 Initial testing with this patch is looking good.  I'm currently running=20
 a 'cvs update' against the same tree in which I'm running a "build.sh=20
 -j24 release" and so far no hang.
 
 > It's compile tested only, so might need some tweaks. The idea is to reset
 > the ccb counter immediatelly, so lddone() would be able to queue another
 > I/O while the completion queue is being still processed. This should also
 > fix ccb leak on errors - e.g. nvme_ns_dobio() calls just nvme_ccb_put()
 > when bus_dmamap_load() fails, so q_nccbs_avail stays decremented from
 > nvme_ccb_get().
 
 Just based on reading the patch, it would appear to make sense to commit=20
 even if it doesn't completely fix the hang.  The patch might not be=20
 "sufficient" but it would appear to be "necessary".  :)
 
 
 +------------------+--------------------------+----------------------------=
 +
 | Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:          =
 |
 | (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   =
 |
 | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org =
 |
 +------------------+--------------------------+----------------------------=
 +
 --0-648942362-1521191266=:10795--
 


Home | Main Index | Thread Index | Old Index