Subject: Re: common irq handler code
To: Steve Woodford <scw@netbsd.org>
From: Chris Gilbert <chris@dokein.co.uk>
List: port-arm
Date: 04/11/2007 01:02:01
This is a multi-part message in MIME format.
--------------060301010207060303060203
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Steve Woodford wrote:
> On Monday 09 April 2007 08:01, Toru Nishimura wrote:
>> Chris Gilbert says;
>>
>> [ ... proposal to add "common irq code path" ... ]
>>
>>> So instead I've taken the footbridge irqhandler code, and made it a
>>> common file. (The footbridge version was based on Jason's work for
>>> xscale platforms)
>> I agree with that providing "the common" is an better idea, but it
>> should be emphasised that it has limitations which might be inadequate
>> for some cases.
> 
> I'll echo that. While the common code may be useful during early bring-up 
> of a new port, there is no substitute for an implementation targetted 
> directly at the flavour of interrupt controller. This code is critical 
> to good performance. In particular, the spl(9) implementation must be as 
> small as possible. I have a few comments...
> 
> 1. Consider the proposed arm_intr_splx() implementation. Right at the 
> start there are a whole bunch of 'extern foo' declarations. Each one of 
> those variables will generate, on ARM, a PC-relative load just to find 
> the address of the variable. This is all inlined code, so even small 
> functions like this contrived example:
> 
> {
>  extern volatile int x;
>  int s = splbio();
>  x = 0;
>  splx(s);
> }
> 
> when compiled will be very much larger than you'd expect. Consider moving 
> some state into cpu_info.

I guess the common parts could be moved, eg current_spl_level.  Which
would save one load/store for each variable.

> 2. As you noted, the 'mask' approach whereby pending interrupts are 
> recorded in a mask which mirrors the hardware's mask registers does not 
> scale well with more than 32 individual interrupt sources. It also 
> becomes extremely unwieldy in splx(9) where you'd have to logically AND 
> multiple 32-bit values to determine if an interrupt is pending. Instead, 
> on CPUs with an ffs(2)-like instruction ("clz" for example) you can use 
> a single 32-bit variable to record pending interrupts per IPL_*. Then 
> splx(9) can simply use "clz" to determine if it needs to invoke the 
> 'dispatch pending' heavy lifter. You also greatly reduce the number of 
> external variable references in splx(9).

Certainly that sounds like a good idea.  In fact using the ipl levels as
a bitmask means that the test actually becomes a simple unsigned
compare.  Take the new ipl and shift by it.  If the ipls pending value
is >= than the shifted ipl value we have something to handle:

splx(int ipl)
{
    extern volatile uint32_t ipls_pending;
    extern volatile int current_spl_level;
    uint32_t iplvalue = (1 << spl);

    /* Don't let the compiler re-order this code with preceding code */
    __insn_barrier();
    current_spl_level = spl;

    if (ipls_pending >= iplvalue)
        splx_heavy_lifter(spl);

    return;
}

would be good enough, without doing the ffs/clz thing?  In fact arm can
use it's barrel shifter to great effect, and the compare ends up as:
mov	r2, #1
cmp	r1, r2, asl r0
movcc   pc, lr
b       splx_heavy_lifter

Where r1 is ipls_pending, r0 is ipl.  Add in the double load for
ipls_pending and that's 5 instructions, another 2 for the storing
current_spl_level, so 7 instructions is probably the smallest I'll get it.

It also allows the spl routines to remain common, but the heavy lifter
vary depending on the target hardware.

> Using the above two suggestions in tandem, the spl(9) implementation will 
> be much smaller, the code will scale to multiple hardware masks much 
> more easily, and you may find some more opportunities to factor out 
> common/shared code.

You're right, I'll look into combining things.  I think the
current_spl_level and ipls_pending would make sense in the cpu_info
(well for now, smp for arm11 would need something different :)

A quick experiment shows that all of the above combined makes:
void test_routine2()
{
        extern volatile int b;
        int oldstate = arm_intr_splraise(10);
        b = 7;
        arm_intr_splx2(oldstate);
}

come out as 13 instructions with no branching:
test_routine2:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r2, .L6
        mov     r3, #10
        ldr     r0, [r2, #4]
        @ lr needed for prologue
        str     r3, [r2, #4]
        ldr     r3, .L6+4
        mov     r1, #7
        str     r1, [r3, #0]
        str     r0, [r2, #4]
        ldr     r1, [r2, #0]
        mov     r3, #1
        cmp     r1, r3, asl r0
        movcc   pc, lr
        b       splx_heavy_lifter
.L7:
        .align  2
.L6:
        .word   cpu_info
        .word   b
        .size   test_routine2, .-test_routine2

For this simple test case, a basic b=7 takes 4 instructions, that's an
overhead of 9 instructions for calling spl() splx()  Which I think is
probably as tight as I'll get it.

Previously the shortest path was 25 instructions :)

Attached is my test .c file, and .s file.  Real code will probably be
very different.

Thanks,
Chris

--------------060301010207060303060203
Content-Type: text/plain;
 name="arm_intr_test.s"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="arm_intr_test.s"

	.file	"arm_intr_test.c"
	.text
	.align	2
	.global	test_routine_raw
	.type	test_routine_raw, %function
test_routine_raw:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L3
	mov	r2, #7
	@ lr needed for prologue
	str	r2, [r3, #0]
	mov	pc, lr
.L4:
	.align	2
.L3:
	.word	b
	.size	test_routine_raw, .-test_routine_raw
	.align	2
	.global	test_routine2
	.type	test_routine2, %function
test_routine2:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r2, .L10
	mov	r3, #10
	ldr	r0, [r2, #4]
	@ lr needed for prologue
	str	r3, [r2, #4]
	ldr	r3, .L10+4
	mov	r1, #7
	str	r1, [r3, #0]
	str	r0, [r2, #4]
	ldr	r1, [r2, #0]
	mov	r3, #1
	cmp	r1, r3, asl r0
	movcc	pc, lr
	b	splx_heavy_lifter
.L11:
	.align	2
.L10:
	.word	cpu_info
	.word	b
	.size	test_routine2, .-test_routine2
	.align	2
	.global	test_routine
	.type	test_routine, %function
test_routine:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 1, uses_anonymous_args = 0
	ldr	r2, .L20
	mov	ip, sp
	mov	r3, #10
	stmfd	sp!, {r4, r5, fp, ip, lr, pc}
	ldr	r5, [r2, #4]
	sub	fp, ip, #4
	str	r3, [r2, #4]
	ldr	r3, .L20+4
	mov	r2, #7
	str	r2, [r3, #0]
	ldr	r3, .L20+8
	ldr	r2, .L20+12
	ldr	r1, .L20+16
	str	r5, [r3, #0]
	ldr	r0, [r2, #0]
	ldr	r3, [r1, r5, asl #2]
	mov	ip, #128
	bics	lr, r0, r3
	bne	.L18
.L13:
	ldr	r3, .L20+20
	ldr	r2, .L20+24
	ldr	r1, [r3, #0]
	ldr	r3, [r2, r5, asl #2]
	bics	r3, r1, r3
	bne	.L19
	ldmfd	sp, {r4, r5, fp, sp, pc}
.L19:
	ldmfd	sp, {r4, r5, fp, sp, lr}
	b	arm_do_pending_soft_intr
.L18:
	mrs     r4, cpsr
bic     r3, r4, ip
eor     r3, r3, ip
msr     cpsr_c, r3

	ldr	r1, .L20+28
	ldr	r2, .L20+32
	ldr	r3, [r1, #0]
	ldr	r0, [r2, #0]
	orr	r3, r3, lr
	str	r3, [r1, #0]
	ldr	r2, [r1, #0]
	mvn	r0, r0
	and	r0, r0, r2
	bl	arm_hardware_set_irq_mask
	and	r4, r4, #192
	mov	r3, #192
	mrs     r1, cpsr
bic     r2, r1, r3
eor     r2, r2, r4
msr     cpsr_c, r2

	b	.L13
.L21:
	.align	2
.L20:
	.word	cpu_info
	.word	b
	.word	current_spl_level
	.word	arm_intr_pending
	.word	arm_imask
	.word	arm_soft_intr_pending
	.word	arm_simask
	.word	intr_enabled
	.word	intr_soft_disabled
	.size	test_routine, .-test_routine
	.ident	"GCC: (GNU) 4.1.2 20070110 (prerelease) (NetBSD nb1 20070110)"

--------------060301010207060303060203
Content-Type: text/plain;
 name="arm_intr_test.c"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="arm_intr_test.c"

/* 	$NetBSD: footbridge_intr.h,v 1.10 2007/03/09 06:45:20 thorpej Exp $	*/

/*
 * Copyright (c) 2001, 2002 Wasabi Systems, Inc.
 * All rights reserved.
 *
 * Written by Jason R. Thorpe for Wasabi Systems, Inc.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 3. All advertising materials mentioning features or use of this software
 *    must display the following acknowledgement:
 *	This product includes software developed for the NetBSD Project by
 *	Wasabi Systems, Inc.
 * 4. The name of Wasabi Systems, Inc. may not be used to endorse
 *    or promote products derived from this software without specific prior
 *    written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY WASABI SYSTEMS, INC. ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
 * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL WASABI SYSTEMS, INC
 * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 */
#include <stdint.h>
#define NIPL 14

#define I32_bit (1 << 7)        /* IRQ disable */
#define F32_bit (1 << 6)        /* FIQ disable */

static __inline uint32_t __set_cpsr_c(uint32_t bic, uint32_t eor) __attribute__((__unused__));

static __inline uint32_t
__set_cpsr_c(uint32_t bic, uint32_t eor)
{
	uint32_t       tmp, ret;

	__asm volatile(
			"mrs     %0, cpsr\n"    /* Get the CPSR */
			"bic     %1, %0, %2\n"  /* Clear bits */
			"eor     %1, %1, %3\n"  /* XOR bits */
			"msr     cpsr_c, %1\n"  /* Set the control field of CPSR */
			: "=&r" (ret), "=&r" (tmp)
			: "r" (bic), "r" (eor) : "memory");
	
	return ret;
}

#define disable_interrupts(mask)                                        \
	(__set_cpsr_c((mask) & (I32_bit | F32_bit), \
		      (mask) & (I32_bit | F32_bit)))

#define enable_interrupts(mask)                                         \
	(__set_cpsr_c((mask) & (I32_bit | F32_bit), 0))

#define restore_interrupts(old_cpsr)                                    \
	(__set_cpsr_c((I32_bit | F32_bit), (old_cpsr) & (I32_bit | F32_bit)))


static inline void __attribute__((__unused__))
arm_intr_splx(int newspl)
{
	extern volatile uint32_t intr_enabled;
	extern uint32_t intr_soft_disabled;
	extern volatile int current_spl_level;
	extern volatile uint32_t arm_intr_pending;
	extern void arm_do_pending_soft_intr(void);
	extern uint32_t arm_imask[NIPL];
	extern uint32_t arm_simask[NIPL];
	extern volatile uint32_t arm_soft_intr_pending;
	int oldirqstate, hwpend;

	/* Don't let the compiler re-order this code with preceding code */
	__insn_barrier();

	current_spl_level = newspl;

	/* un-mask any interupts that we can now handle */
	hwpend = (arm_intr_pending & ~arm_imask[newspl]);
	if (hwpend != 0) {
		oldirqstate = disable_interrupts(I32_bit);
		intr_enabled |= hwpend;
		arm_hardware_set_irq_mask(intr_enabled & ~intr_soft_disabled);
		restore_interrupts(oldirqstate);
	}

        if (arm_soft_intr_pending & ~arm_simask[newspl])
                arm_do_pending_soft_intr();
}

struct test_cpu_info
{
	volatile unsigned int ipls_pending;
	volatile int current_spl_level;
};

extern struct test_cpu_info cpu_info;
	

static inline void __attribute__((__unused__))
arm_intr_splx2(int spl)
{
	unsigned int iplvalue = (1 << spl);

	/* Don't let the compiler re-order this code with preceding code */
	__insn_barrier();
	cpu_info.current_spl_level = spl;

	if (cpu_info.ipls_pending >= iplvalue)
		splx_heavy_lifter();

	return;
}

static inline int __attribute__((__unused__))
arm_intr_splraise(int ipl)
{
	int	old;

	old = cpu_info.current_spl_level;
	cpu_info.current_spl_level = ipl;

	/* Don't let the compiler re-order this code with subsequent code */
	__insn_barrier();

	return (old);
}

static inline int __attribute__((__unused__))
arm_intr_spllower(int ipl)
{
	int old = cpu_info.current_spl_level;

	arm_intr_splx(ipl);
	return(old);
}

void
test_routine()
{
	extern volatile int b;
	int oldstate = arm_intr_splraise(10);
	b = 7;
	arm_intr_splx(oldstate);
}

		
void
test_routine2()
{
	extern volatile int b;
	int oldstate = arm_intr_splraise(10);
	b = 7;
	arm_intr_splx2(oldstate);
}

void test_routine_raw()
{
	extern volatile int b;
	b = 7;
}


--------------060301010207060303060203--