port-arm: Re: port-shark/22355 [was: Help needed to fix NetBSD/shark]

Subject: Re: port-shark/22355 [was: Help needed to fix NetBSD/shark]
To: Izumi Tsutsui <tsutsui@ceres.dti.ne.jp>
From: Chris Gilbert <chris@dokein.co.uk>
List: port-arm
Date: 08/06/2007 23:35:22
Izumi Tsutsui wrote:
> jmmv84@gmail.com wrote:
> 
>>>> So these patches are just hiding an underlying bug somewhere else by
>>>> reseting the spl_mask after interrupt routines are called, however  
>>>> the
>>>> mask should have been reset back by an splx call.
>>> The attached one also seems to fix the problem.
>>> (use spl_masks[current_spl_level] rather than spl_mask in  
>>> irq_setmask())
>> Indeed.  I've been running the machine under load (building some  
>> packages) with your patch applied and it has been flawless.
>>
>> The only thing that worries me (based on Chris' comments) is that  
>> this might be hiding some other problem instead of fixing the real  
>> root cause.  But I don't know enough to see if that's the case or not...
> 
> ---
> 
> With my first patch, spl_mask is restored properly between (5) and (6).
> 
> With my latter patch, spl_mask is no longer used in irq_setmasks()
> so interrupts are not blocked. The spl_mask variable might be
> restored further splfoo()/splx(s) pairs before switching to cpu_idle().
> (spl_mask is no longer referred anyway in that case)
> 
> I'm not sure why your first patch doesn't work though.

That all makes sense, can you commit your fix?  Certianly for the
original issue where interrupts stop entirely, the fix to always look up
the mask is the best fix for now.

The issue jmmv mentions (my comments) is that I was puzzled that an
interrupt handler was returning returns with an SPL higher it started at.

Having spoken with ad@ about this, it's an expected thing due to how
mutex locks are nested, as the cpu_info_store a counter of mutexes in
use, and the original spl at entry to the first one.  This means that
the last mutex out will reset the spl level correctly, it's something
that I didn't expect, but it's how things currently work.

Basically, it's something like:
(1) in kernel, running some code:
   -> current_spl_level=IPL_NONE
   -> ci->ci_mtx_count=0, ci->ci_mtx_oldspl=xxxx

(2) calls mutex_spin_enter at IPL_VM (eg uvm_hashlocks)
   -> current_spl_level=IPL_VM
   -> ci->ci_mtx_count=-1, ci->ci_mtx_oldspl=IPL_NONE

(3) clock interrupt and a disk interrupt occur and irq_entry() is
called, we figure out there's a clock irq, and adjust the spl:
   -> current_spl_level = IPL_CLOCK
   -> ci->ci_mtx_count=-1, ci->ci_mtx_oldspl=IPL_NONE

(4) clockintr() calls hardclock, which calls callout_hardclock, which
takes the callout_lock mutex at IPL_SCHED (IPL_HIGH)
   -> current_spl_level = IPL_HIGH
   -> ci->ci_mtx_count=-2, ci->ci_mtx_oldspl=IPL_NONE

(5) callout_hardclock releases the mutex, but it's not the first mutex
so doesn't adjust IPL, and so clockintr returns to irq_entry as:
   -> current_spl_level = IPL_HIGH
   -> ci->ci_mtx_count=-1, ci->ci_mtx_oldspl=IPL_NONE

(6) irq_entry doesn't adjust the IPL yet, and calls the disk interrupt
handler (note that it probably used to do this at IPL_CLOCK as the irq
code is very poor at ipl priority handling)
   -> current_spl_level = IPL_HIGH
   -> ci->ci_mtx_count=-1, ci->ci_mtx_oldspl=IPL_NONE

(7) the disk interrupt routine exits:
   -> current_spl_level = IPL_HIGH
   -> ci->ci_mtx_count=-1, ci->ci_mtx_oldspl=IPL_NONE

(7) on exit from irq_entry the IPL level is reset back to that which it
was on entry:
   -> current_spl_level=IPL_VM
   -> ci->ci_mtx_count=-1, ci->ci_mtx_oldspl=IPL_NONE

(8) kernel code finishes off what it was doing, and now calls
mutex_spin_exit:
   -> current_spl_level=0
   -> ci->ci_mtx_count=0, ci->ci_mtx_oldspl=IPL_NONE

The reason this came up was that I felt paranoid and added code to check
that the IPL level on exit from an irq handler was the same as it was on
entry.  In this scenario it's not the case.

Speaking with ad@ he said at one point the code did save and restore the
mutex state info, but he didn't feel the extra complexity was necessary
as it would all work itself out, which it eventually does.

My lingering concern is that the IPL level will be raised and block
interrupts for longer than it should.  Perhaps this is more a concern on
slower hardware?  Or where the MD interrupt code doesn't start with the
highest IPLs and work down through the pending interrupts, so isn't
reseting the ipl level after every interrupt routine is called.

Thanks,
Chris