[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: what to do on memory or cache errors?
On Mon, 22 Aug 2011, Matt Thomas wrote:
> besides panicing, of course.
> This is going to involve a lot of help from UVM.
> It seems that uvm_fault is not the right place to handle this. Maybe we need
> void uvm_page_error(paddr_t pa, int etype);
> where etype would indicate if this was a memory or cache fault, was the cache
> line dirty, etc. If uvm_page_error can't "correct" the error, it would panic.
> Interactions with copyin/copyout will also need to be addressed.
> Preemptively, we could have a thread force dirty cache lines to memory if
> they've been in L2 "too long" (thereby reducing the problem to an ECC error
> on a clean cache line which means you just toss the cache-line contents.) We
> can also have a thread that reads all of memory (slowly) thereby causing any
> single bit errors to be corrected before they become double-bit errors.
> I'm not familiar enough with UVM internals to actually know what to do but I
> hope someone else reading this is.
> Comments anyone?
(I can't believe I'm actually getting involved in this discussion.)
I would recommend against trying to add memory error recovery.
1) It doesn't happen very often.
2) It's HARD to implement. (More on this later.)
3) It's difficult to verify correct operation because of 1.
4) It's highly machine dependent.
5) If you claim to support this and it doesn't work it may open up legal
If you did want to do this, most of it would be in MD code. This means
both pmap/page fault handling code for CPU faults and on the I/O side for
Things get really interesting (complicated) when you get a fault and try
to determine the faulting address. The design of the processor, cache,
memory, and I/O subsystems is important here. Where is ECC generated and
checked? In the memory controller? The cache? The main bus? The CPU
core? How many cache levels are there? Are they write-back or
write-through? Is there an I/O cache? All these variables affect the MD
portions of the design, which you need to get right to be able to properly
survive an error without creating the possiblity of data corruption. We
could discuss what steps are needed to recover from a specific type of
memory error in a particular cache level on one model of CPU, but I don't
think that's something that can be generalized.
Let's assume for the sake of argument that you can implement the MD parts
of memory error handling correctly across a non-trivial set of machines.
This means you can identify the fault and cleaned up the state as much as
possible. What do you do now?
There are two different types of faults, correctable and uncorrectable.
Correctable faults are annoying but dealing with them is relatively
simple, assuming the system is set up to report correctable faults. You
first need to determine if the fault is a hard fault or a soft fault by
retrying the faulting operation to see if it recurs. If a memory or cache
location was modified by random radiation you have a soft fault that is
unlikely to recur. In that case you need to keep track of the fault rate
of that device to decide if it's beginning to wear out and needs to be
replaced. If it doesn't need to be replaced then just go about your
If the fault rate is too high, or the memory location has a hard fault,
say a trace has shorted out and you have lost some of your redundancy and
should stop using that device. If the memory location is in a cache, you
need to figure out how to disable it. If the problem is in RAM, you can
retire the page and hope it's the only one affected or you can disable the
entire device. To disable a device you need to migrate all the pages off
the device to some other location. This brings us to an interesting
problem of identifying the specific piece of hardware that corresponds to
a certain memory range. If you don't know where the memory associated
with that device starts or ends you don't know how many pages need to be
migrated or where are safe destinations. And there's the problem of
generating the error message "Please replace DIMM number 53."
If you have an uncorrectable error things get more interesting. After
retiring the memory you need to try to recover the system. Obviously if
you have a clean page you should be able to recover it either from backing
store or ZFOD. If it's a dirty userland page you can usually send a the
process a SIGBUS, unless the error was caused by DMA, in which case things
get really interesting. Do you retry the operation and attempt to correct
it or generate an error? Do you send a signal or return an error from an
I/O system call? It depends on the device and the type of I/O operation,
synchronous, asynchronous, or memory mapped.
Finally kernel pages may not be recoverable or relocatable, depending on
how the kernel address space is managed for a particular machine.
Anyway, I'd think the first step you'd want to take if you really want to
go in this direction is to add memory migration to UVM. This would be
useful in other situations such as dynamic reconfig. Once UVM is capable
of forcibly reclaiming and detaching specific physical address ranges then
you can start working on error reporting and recovery.
Main Index |
Thread Index |