Port-vax archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Race in MSCP (ra/rx) driver




> On Aug 26, 2020, at 2:13 PM, Mark Pizzolato - Info Comm <Mark%infocomm.com@localhost> wrote:
> 
> On Wednesday, August 26, 2020 at 8:48 AM, Mouse wrote:
>> rx_putonline(struct rx_softc *rx)
>> {
>> ...
>>        /* Poll away */
>>        bus_space_read_2(mi->mi_iot, mi->mi_iph, 0);
>>        if (tsleep(&rx->ra_state, PRIBIO, "rxonline", 100*100))
>>                rx->ra_state = DK_CLOSED;
>> 
>> In 1.4T, this code runs at IPL 0.  Unless it is called at elevated IPL in -current,
>> there is a race: if the operation completes and interrupts before tsleep puts
>> the thread to sleep, it will lose - it will sleep for 10000 ticks and then fail.
>> Presumably most real hardware isn't that fast, but something like an MSCP
>> interface backed by RAM or, in my case, a simulation, can trip this race.  (simh
>> has comments saying, to rephrase from memory, that VMS works with an
>> infinitely fast MSCP disk, but the BSDs don't - this is likely part of what's
>> behind the latter.)
> 
> Simh may have a comment like that, but the MSCP device simulator does 
> not implement infinitely fast I/O completion.  This is true for most devices, 
> in most simulators, not just MSCP.  Software was written for real hardware 
> and none of that completed I/O in 0 time, so simh's common device 
> model provides a mechanism to specify a number of instructions to delay 
> before I/O completion is signaled.  The MSCP device simulation support
> true asynchronous I/O, and for this device the simulation framework still 
> provides a concept of "at least n instruction delay" before I/O completion.

Fair enough.  But as a matter of good software engineering, drivers should not rely on operations taking some minimal non-zero amount of time.  There may be a few rare cases where such assumptions are unavoidable, but if -- as here -- they can be avoided, they should be.

For one thing, if code runs in IPL 0, things may appear to happen "in zero time" not because they actually did, but because some other interrupt that happened to hit right at that point sucked up the time needed for the I/O completion.  Errors of that kind will result in intermittent failures under load.

I think Mouse made a good change here.

	paul



Home | Main Index | Thread Index | Old Index