"Cross-cancelling" of thread pool jobs

I'm moving this particular part of the threadpool follow-up discussion to tech-kern.

On Dec 26, 2018, at 10:04 AM, Taylor R Campbell <campbell+netbsd-source-changes-d%mumble.net@localhost> wrote:

+bool
+threadpool_cancel_job_async(threadpool_t *pool, threadpool_job_t *ext_job)
+{
+ threadpool_job_impl_t *job = THREADPOOL_JOB_TO_IMPL(ext_job);
+
+ KASSERT(mutex_owned(job->job_lock));
+
+ /*
+ * XXXJRT This fails (albeit safely) when all of the following
+ * are true:
+ *
+ * => "pool" is something other than what the job was
+ *    scheduled on. This can legitimately occur if,
+ *    for example, a job is percpu-scheduled on CPU0
+ *    and then CPU1 attempts to cancel it without taking
+ *    a remote pool reference. (this might happen by
+ *    "luck of the draw").

Under what circumstances can this happen? This sounds like a bug that
we should KASSERT into oblivion if there's any danger of it.

Consider this hypothetical...

Let's say you have MyNiftyApplication that from time to time wants to do some long-running chore that thread pools are well-suited for. For its own reasons, it uses a percpu thread pool because the application designer determined that at the time the decision is made to run the job, it should run on the CPU that did the deciding, although for the purposes of deciding to schedule the job again while the job is already in-progress, it is not critical that this is the case.

Now let's way MyNiftyApplication is exiting, and so it wants to ensure it's long-running chore is stopped. So it calls threadpool_cancel_job(). Now, which individual pool in the percpu cadre should be passed to threadpool_cancel_job()? The job itself doesn't directly track which pool it was last successfully scheduled on, and because thredpool_schedule_job() doesn't return any indication that the job was already scheduled, MyNiftyApplication can't reliably track it either.

So, threadpool_cancel_job_async() does a best-effort... If it knows the job has no assigned thread, ta-da, cancelled! If it's assigned thread is the specified pool's overseer, ta-da, cancelled! But that it can't tell is if the assigned thread is some *other* pool in the cadre's overseer, and so it falls into the already-running-must-wait case.

I don't see how this is particularly harmful, as it will result in callers simply having to wait a little longer in a case there they might not otherwise have had to. Job cancellation is inherently racy, which is why we sometimes have to tell callers to wait. Considering this edge case still results in safe behavior, I don't think it's worth throwing extra bookkeeping at the problem.

-- thorpej