Subject: i386 intr rewrite (integrating the MP case)
To: None <tech-smp@netbsd.org, port-i386@netbsd.org>
From: Frank van der Linden <fvdl@wasabisystems.com>
List: tech-smp
Date: 08/13/2002 16:24:16
As you may know, I have been working on rewriting the i386 intr
code for a while. I seem to be stuck at the moment, so anyone
who has any suggestions as to how to proceed, I'll be very thankful.

The code is in ftp://ftp.netbsd.org/pub/NetBSD/misc/fvdl/i386intr.tgz,
it's a complete tar of the sys/arch/i386 directory. Below is a description.
Note that the code will probably not drop into -current, because
of some recent changed that I did not sync yet.

Short introduction:

The reason for this rewrite was that their were some problems with the
i386 interupt code. It was scattered over several files in the wrong
places, the MP code (local apic) used the TPR, but accessing it turned
out to be slow, etc. I've made an attempt to unify the code and
have it all use the same 'soft' priority mechanism, similar to
what is used in the UP case now.

A quick description:

* In the most generic case, the number of interrupt sources is limited
  to 224 (256 IDT entries, 32 taken by exceptions) per CPU. In our case,
  we have one IDT, which makes for 224 total. It seems like total
  overkill to consider > 224 interrupt sources, so I'm going to skip
  the per-CPU IDT case. To make all distributions for interrupts over
  CPUs possible, their should be 224 interrupt sources (and a mask
  for 224 sources, etc) per CPU. This is still overkill. I settled
  on having a maximum of 32 interrupts (well, 32 - 3 softints - 2 IPIs)
  per CPU. On all systems that I currently know of, this will still
  allow you to route all interrupts to one CPU, I have not seen
  systems with > distinct 29 interrupt sources.

  32 interrupt sources allows for 32bits-sized pending and mask words.

* Each interrupt source is described by the following structure:

struct intrsource {     
        struct intrhand *is_handlers;   /* linked list of registered handlers */
        struct pic *is_ioapic;		/* originating PIC */
        int is_flags;			/* see below */
        int is_pin;			/* IRQ for legacy; pin for IO APIC */
        int is_type;			/* level, edge */
        void *is_entry;			/* stub code pointer */
        void *is_recurse;		/* resume and recurse points, see */
        void *is_resume;		/* current code */
};

#define IS_LEGACY       0x0001          /* legacy ISA irq source */
#define IS_IPI          0x0002          /* IPI source */

* The cpu_info structure is extended with the following fields:

        struct intrsource *ci_isources[MAX_INTR_SOURCES];

        u_int32_t       ci_ipending;            /* pending for this CPU */
        int             ci_ilevel;              /* soft tpr */
        u_int32_t       ci_imask[NIPL];         /* as before */
        u_int32_t       ci_iunmask[NIPL];       /* as before */

* cpu_info_primary (the boot cpu, and only CPU on UP systems) has its
  first 16 entries in ci_isources initially reserved for legacy IRQ
  handlers.  For an MP system, establishing an interrupt with MP mapping
  attached to the ISA bus will also use these [should probably
  extend that to "anything that maps a legacy IRQ", that can be done
  with minimal change].
* IDT entries are allocated dynamically in the range 0x20-0xf0.

* All PIC devices and faked devices (ioapic, local apic, i8259),
  have a softc structure that starts with a 'struct pic'. It looks like
  this:

struct pic {
        struct device pic_dev;
        int pic_type;
        struct simplelock pic_lock;
        void (*pic_hwmask)(struct pic *, int);
        void (*pic_hwunmask)(struct pic *, int);
        void (*pic_addroute)(struct pic *, struct cpu_info *, int, int, int);
        void (*pic_delroute)(struct pic *, struct cpu_info *, int, int, int);
        struct intrstub *pic_level_stubs;
        struct intrstub *pic_edge_stubs;
};

* An interrupt distribution mechanism is possible, but currently
  the code just fills up cpu0 first, then takes cpu1, etc.
  This is localized in one function.

* The stub code is generated by CPP magic. There are level and edge
  stubs for each pic. If MULTIPROCESSOR is defined, the stubs grab
  the kernel lock (possibly recursively) before calling the handler(s).

* Handlers are in a list, sorted by priority. The stub walks the list
  of handlers, calling them as long as their priority is higher than
  the currently set one.

* Stub number N (as generated, currently named XintrN) uses N as the
  index in the intrsource array. For legacy IRQs (i8259 stubs),
  N == IRQ, so the hardware masking can be done using the old
  macro-generated asm code. For others, it's done via
  ci_isources[N]->is_pic.xx

* the spl/splx/Xdoreti code remains much the same, except that the
  global variables they now go through have become CPU-local,
  and the Xresume/Xrecurse are taken from the intrsource structure.

Some issues:

* To avoid race conditions for an interrupt, it should remain blocked
  in hardware until the handler has run. The IO apic can mask interrupts,
  by setting a bit in the redirection table entry for the interrupt. However,
  this is widely reported to cause edge-triggered interrupts to get lost. So
  that can't be used. Another option is to ack the interrupt after all
  handlers have been run. The problem with this, is that an ack only
  acks the highest priority interrupt (highest interrupt ==> highest
  position in the IDT for the local APIC), so that means that ideally,
  IDT allocation should follow software priority, like before.

* I'm using %ebp in the intrstubs, which means that currently DDB
  traces are broken through interrupts. This can be fixed by making
  DDB a little smarter about finding the right frame, but I've not
  done that yet.

* Some of the defines for asm code doing acks uses numeric labels, which
  should be fixed.

* Because I was in the middle of debugging, the interrupt stubs always
  use a 'late ack' now, but this won't be done for the level-triggered
  case eventually.


Current state:

It boots fine on a uniprocessor system. For the multiprocessor case
it gets halfway through rc.d and then hangs because apparently
an IPI happens when it should be blocked, and both CPUs go into
a spinlock deadlock.


Why am I posting this? Well, I'd like people to read the code and
send comments, and if you spot something that would point to the
problem I'm seeing, that'd be great, as I will not be able to
spend much time on this for few weeks.

-- 
Frank van der Linden                                    fvdl@wasabisystems.com
==============================================================================
Quality NetBSD Development, Support & Service.   http://www.wasabisystems.com/