sysmon_envsys race

To: tech-kern%netbsd.org@localhost
Subject: sysmon_envsys race
From: Julian Coleman <jdc%coris.org.uk@localhost>
Date: Thu, 29 Oct 2020 21:01:51 +0100

Hi all,

While testing changes to an envsys driver, I saw this crash on shutdown:

  [ 1651.0108940] cpu0: data fault: pc=155ea68 rpc=101db8ca4 addr=0
  [ 1651.0108940] kernel trap 30: data access exception
  Stopped in pid 0.5 (system) at  netbsd:mutex_oncpu.part.0+0x8:  ldx
  [%g1 + 0x18], %g2
  db{0}> bt
  sme_events_check(101db6718, 101d8a041, 0, 1c63348, 101db6640, 101d8a040) at netbsd:sme_events_check+0xc

This is:
  line 739	mutex_enter(&sme->sme_work_mtx);

The driver runs sysmon_envsys_destroy() in its detach routine.  Looking at
the code, it looks like that could race with sme_events_check() whilst the
sme sensors list is being removed - they both start by checking that
sme != NULL but sysmon_envsys_destroy() could remove the sme structure
whilst sme_events_check() is running.  I'm guessing that's what happened
in the above case.  Note, that I only saw this once in about 50 reboots,
so it's quite rare.

It seems sensible to take the sme_mtx in sysmon_envsys_destroy(), but
that just reduces the window - sme_events_check() checks sme != NULL and
the mutexs are part of the sme structure that we want to remove.

There is code in sysmon_envsys_sensor_detach() which removes callouts,
so a better solution might be to call sysmon_envsys_sensor_detach() from
sysmon_envsys_destroy(), or audit every driver to check that is done.

Any other solution appreciated.

Regards,

Julian

Follow-Ups:
- Re: sysmon_envsys race
  - From: Michael van Elst

Prev by Date: Re: RAIDframe: what if a disc fails during copyback
Next by Date: NVMM missing opcode REPE CMPS implementation
Previous by Thread: RAIDframe: what if a disc fails during copyback
Next by Thread: Re: sysmon_envsys race
Indexes:

Home | Main Index | Thread Index | Old Index