tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: what to do on memory or cache errors?

ECC is enabled enly if
- memory controller supprts ECC,
- memory module supports ECC, and
- both are configured to use ECC,
right?  Then interrupt handler should invoke the interrupt handler
registered by the relevant memory module's device driver.  Memory
controllers & modules are auto-conf'ed ideally.  (Think memory hotplug

Similarly CPU drivers should register cache error handlers too, if needed.

I'm not sure what to be done in these handlers.  Maybe we can learn
from high-end/mission-critical industrial/commercial systems?

I think notification from physical memory/address has some need in
other cases like migration.  Good to see more interest in this area.

On Tue, Aug 23, 2011 at 5:58 AM, Matt Thomas <> 
> besides panicing, of course.
> This is going to involve a lot of help from UVM.
> It seems that uvm_fault is not the right place to handle this.  Maybe we need 
> a
> void uvm_page_error(paddr_t pa, int etype);
> where etype would indicate if this was a memory or cache fault, was the cache 
> line dirty, etc.  If uvm_page_error can't "correct" the error, it would panic.
> Interactions with copyin/copyout will also need to be addressed.
> Preemptively, we could have a thread force dirty cache lines to memory if 
> they've been in L2 "too long" (thereby reducing the problem to an ECC error 
> on a clean cache line which means you just toss the cache-line contents.)  We 
> can also have a thread that reads all of memory (slowly) thereby causing any 
> single bit errors to be corrected before they become double-bit errors.
> I'm not familiar enough with UVM internals to actually know what to do but I 
> hope someone else reading this is.
> Comments anyone?

Home | Main Index | Thread Index | Old Index