Subject: Re: i386 Operating Systems
To: None <bsdealwi@uwaterloo.ca>
From: Charles M. Hannum <mycroft@ai.mit.edu>
List: port-i386
Date: 05/24/1995 12:54:58
   1. NetBSD doesn't use the TSS at all, unlike Linux. Why was this
      decision made?

That depends on how you use the TSS.  The JMP TSS instruction does
more work than we actually want in the context switcher, mostly
related to saving and restoring of registers.

Currently, we don't switch TSSes at all, because we don't need to.  We
map all the kernel stacks at the same virtual address, so we don't
have to worry about the stack base address in the TSS, and we don't
use I/O bitmaps at all.

There is some work to change this in progress, but it's not in the
main source tree.  We're eliminating the copying of the kernel stack
on fork(2), which speeds it up a bit, and also allows us to map the
kernel stacks at unique addresses.  This, and allowing I/O bitmaps,
also requires at least changing the TSS on each context switch.  In
turn, we can now avoid switching the page table at all in cases where
we're switching to a `system' process, which could be a boon for NFS
servers (and clients, to a lesser extent).

Right now our experimental code is creating a TSS per process, and
using the LTR instruction to *just* switch the TSS (and thus the I/O
map and stack base), and not do any of the implicit register
saving/restoring that JMP TSS (or CALL TSS) would do.  It's possible
that we're approaching the trade-off point where it's easier and about
the same speed to just use JMP TSS and not do all of the work
manually.  I'll have to do statistics on this at some point.

   2. How are processes' memory space represented? It seems that each process
      has one segment, with the code, data, and stack.

There are two parts to this:

1) The pmap.  This is a MMU-dependent structure, usually a page table
that the hardware can look up a page in automatically on a TLB miss.
When this fails (because a page is not in the page table), the second
set of data structures is consulted.

2) The VM maps.  Each process has a `map' associated with it, which is
essentially a list of mapped memory regions.  Each region corresponds
to a particular `object' and an offset and size within that object.
An object is something like a file or an anonymous memory region, and
can have a `pager' and a `shadow object' associated with it.  The
pager is used to fetch pages that are mapped but not present in the
page table, and to flush pages if memory is needed.  The shadow object
is a sharing mechanism; multiple objects can have the same shadow
object and use its pages.  Pages in a shadow object be either
completely shared or shared copy-on-write.

   3. How are the descriptor tables used in these OS's?

The GDT currently has a fixed number of elements, corresponding mainly
to kernel code and data segments.  The LDT is used for user segments,
and also contains the iBCS2 and 386BSD system call gate.  The IDT is
used in the obvious fashion, and also contains the Linux system call
gate (which we also use now by default).

There are a couple of changes occuring here, though:

1) User code and data segments have been added to the GDT.  Currently,
these are only used by some Linux executables, but may also be useful
for other quasi-iBCS2 or SVR4 programs that assume the user segments
point to particular GDT slots.  (They're also useful for fixing a
`cute' signal handling problem that I won't go into here.)

2) The TSS and (possibly, depending on the process) LDT will soon have
their own per-process segments in the GDT, so they can be loaded more
quickly at context switch time.  Right now we keep a cached copy of
the per-process LDT descriptor, and copy it into the GDT and reload
the descriptor with LLDT each time we switch into a process that has
changed its LDT, but this is obviously more expensive that we would
like.

   4. Are there any programming issues we should be aware of between the
      386 and the 486? Are there special twiddles needed for the 486's
      on-chip cache?

The 386 and 486 are very similar from a programmer's perspective.  The
only truly notable difference is that the 386 doesn't honor the write
protect bit in the page tables while in supervisor mode, which
complicates copyin/copyout a bit.  Most of the other differences are
irrelevant.

The cache is *usually* coherent enough in PCs that you don't have to
think about it.  There are specific exceptions to this, though;
notably, Cyrix 486SLC and 486DLC chips in 386 sockets may not have
proper cache coherency.

   5. What 386 features are `useless'? What features are too slow for
      practical use?

There's a feature to automatically copy arguments when entering a call
gate.  For a 386-specific operating system, this could be useful.  You
could arrange to have one system call gate per number of arguments,
and would avoid some extra overhead that we currently incur to fetch
the arguments from user space.  Of course, you could get essentially
the same speedup by passing the arguments in registers.

It turns out that, if you're not using the aforementioned feature,
it's actually slightly better to use a trap gate in the IDT for
entering system calls than the traditional call gate.  The call gate
interface requires some extra code (and thus overhead) to deal with
the case of the trap flag being set on entry; the trap gate turns it
off automatically, and thus avoids the overhead.  From the OS's
perspective, call gates are essentially useless, except for backward
compatibility.

Using task gates for interrupts and/or traps would provide some added
protection, and could allow the per-process kernel stack to be
smaller.  In practice, though, this would probably be more expensive
either because the page table would be switched frequently, or because
there would be a bunch of hackery to avoid switching the page table.

Those are the only ones that come to mind offhand.