Re: kern/52769: hang with an ffs stored in an nvme device

2017-12-01 19:45 GMT+01:00 Chuck Silvers <chuq%chuq.com@localhost>:
> q_nccbs = 0x20,
> q_nccbs_avail = 0x21,

This is highly suspicious, q_nccbs_avail should always be <= q_nccbs. Good that the driver deadlocked, it would panic in nvme_ccb_get() once it would try to get the nonexisting 33th ccb from queue :)

This got me thinking though. If the completion queue is being processed, we currently don't reset q_nccbs_avail until after all finished ccbs are processed. While this is running, any further I/O would be skipped with EAGAIN, if all ccbs were taken and q_nccbs_avail was 0. When the ccb counter is reset on the end of nvme_q_complete(), there is no outstanding I/O any more which would trigger another lddone() and do the queue drain, so the driver ceases to process anything. This scenario matches the described symtoms quite well.

Can you please try patch from http://www.netbsd.org/~jdolecek/nvme_avail_put.diff ?

It's compile tested only, so might need some tweaks. The idea is to reset the ccb counter immediatelly, so lddone() would be able to queue another I/O while the completion queue is being still processed. This should also fix ccb leak on errors - e.g. nvme_ns_dobio() calls just nvme_ccb_put() when bus_dmamap_load() fails, so q_nccbs_avail stays decremented from nvme_ccb_get().

Jaromir