Re: what to do on memory or cache errors?

To: Matt Thomas <matt%3am-software.com@localhost>
Subject: Re: what to do on memory or cache errors?
From: Eduardo Horvath <eeh%NetBSD.org@localhost>
Date: Thu, 25 Aug 2011 16:28:48 +0000 (UTC)

On Mon, 22 Aug 2011, Matt Thomas wrote:

> besides panicing, of course.
> 
> This is going to involve a lot of help from UVM.  
> 
> It seems that uvm_fault is not the right place to handle this.  Maybe we need 
> a
> 
> void uvm_page_error(paddr_t pa, int etype);
> 
> where etype would indicate if this was a memory or cache fault, was the cache 
> line dirty, etc.  If uvm_page_error can't "correct" the error, it would panic.
> 
> Interactions with copyin/copyout will also need to be addressed.
> 
> Preemptively, we could have a thread force dirty cache lines to memory if 
> they've been in L2 "too long" (thereby reducing the problem to an ECC error 
> on a clean cache line which means you just toss the cache-line contents.)  We 
> can also have a thread that reads all of memory (slowly) thereby causing any 
> single bit errors to be corrected before they become double-bit errors.
> 
> I'm not familiar enough with UVM internals to actually know what to do but I 
> hope someone else reading this is.
> 
> Comments anyone?

(I can't believe I'm actually getting involved in this discussion.)

I would recommend against trying to add memory error recovery.

1) It doesn't happen very often.

2) It's HARD to implement.  (More on this later.)

3) It's difficult to verify correct operation because of 1.

4) It's highly machine dependent.

5) If you claim to support this and it doesn't work it may open up legal 
issues.

If you did want to do this, most of it would be in MD code.  This means 
both pmap/page fault handling code for CPU faults and on the I/O side for 
DMA issues.  

Things get really interesting (complicated) when you get a fault and try 
to determine the faulting address.  The design of the processor, cache, 
memory, and I/O subsystems is important here.  Where is ECC generated and 
checked?  In the memory controller?  The cache?  The main bus?  The CPU 
core?  How many cache levels are there?  Are they write-back or 
write-through?  Is there an I/O cache?  All these variables affect the MD 
portions of the design, which you need to get right to be able to properly 
survive an error without creating the possiblity of data corruption.  We 
could discuss what steps are needed to recover from a specific type of 
memory error in a particular cache level on one model of CPU, but I don't 
think that's something that can be generalized.

Let's assume for the sake of argument that you can implement the MD parts 
of memory error handling correctly across a non-trivial set of machines.  
This means you can identify the fault and cleaned up the state as much as 
possible.  What do you do now?

There are two different types of faults, correctable and uncorrectable.  
Correctable faults are annoying but dealing with them is relatively 
simple, assuming the system is set up to report correctable faults.  You 
first need to determine if the fault is a hard fault or a soft fault by 
retrying the faulting operation to see if it recurs.  If a memory or cache 
location was modified by random radiation you have a soft fault that is 
unlikely to recur.  In that case you need to keep track of the fault rate 
of that device to decide if it's beginning to wear out and needs to be 
replaced.  If it doesn't need to be replaced then just go about your 
business.  

If the fault rate is too high, or the memory location has a hard fault, 
say a trace has shorted out and you have lost some of your redundancy and 
should stop using that device.  If the memory location is in a cache, you 
need to figure out how to disable it.  If the problem is in RAM, you can 
retire the page and hope it's the only one affected or you can disable the 
entire device.  To disable a device you need to migrate all the pages off 
the device to some other location.  This brings us to an interesting 
problem of identifying the specific piece of hardware that corresponds to 
a certain memory range.  If you don't know where the memory associated 
with that device starts or ends you don't know how many pages need to be 
migrated or where are safe destinations.  And there's the problem of 
generating the error message "Please replace DIMM number 53."

If you have an uncorrectable error things get more interesting.  After 
retiring the memory you need to try to recover the system.  Obviously if 
you have a clean page you should be able to recover it either from backing 
store or ZFOD.  If it's a dirty userland page you can usually send a the 
process a SIGBUS, unless the error was caused by DMA, in which case things 
get really interesting.  Do you retry the operation and attempt to correct 
it or generate an error?  Do you send a signal or return an error from an 
I/O system call?  It depends on the device and the type of I/O operation, 
synchronous, asynchronous, or memory mapped.

Finally kernel pages may not be recoverable or relocatable, depending on 
how the kernel address space is managed for a particular machine.

Anyway, I'd think the first step you'd want to take if you really want to 
go in this direction is to add memory migration to UVM.  This would be 
useful in other situations such as dynamic reconfig.  Once UVM is capable 
of forcibly reclaiming and detaching specific physical address ranges then 
you can start working on error reporting and recovery.

Eduardo

References:
- what to do on memory or cache errors?
  - From: Matt Thomas

Prev by Date: Re: [PATCH] Re: zero-filed page on VOP_PUTPAGES
Next by Date: PUFFS deadlocks when memory is low (was: Re: [PATCH] Re: zero-filed page on VOP_PUTPAGES)
Previous by Thread: Re: what to do on memory or cache errors?
Next by Thread: Where are the specific WARNS=n defined?
Indexes:

Home | Main Index | Thread Index | Old Index