tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: what to do on memory or cache errors?



ECC is enabled enly if
- memory controller supprts ECC,
- memory module supports ECC, and
- both are configured to use ECC,
right?  Then interrupt handler should invoke the interrupt handler
registered by the relevant memory module's device driver.  Memory
controllers & modules are auto-conf'ed ideally.  (Think memory hotplug
support.)

Similarly CPU drivers should register cache error handlers too, if needed.

I'm not sure what to be done in these handlers.  Maybe we can learn
from high-end/mission-critical industrial/commercial systems?

I think notification from physical memory/address has some need in
other cases like migration.  Good to see more interest in this area.

On Tue, Aug 23, 2011 at 5:58 AM, Matt Thomas <matt%3am-software.com@localhost> 
wrote:
>
> besides panicing, of course.
>
> This is going to involve a lot of help from UVM.
>
> It seems that uvm_fault is not the right place to handle this.  Maybe we need 
> a
>
> void uvm_page_error(paddr_t pa, int etype);
>
> where etype would indicate if this was a memory or cache fault, was the cache 
> line dirty, etc.  If uvm_page_error can't "correct" the error, it would panic.
>
> Interactions with copyin/copyout will also need to be addressed.
>
> Preemptively, we could have a thread force dirty cache lines to memory if 
> they've been in L2 "too long" (thereby reducing the problem to an ECC error 
> on a clean cache line which means you just toss the cache-line contents.)  We 
> can also have a thread that reads all of memory (slowly) thereby causing any 
> single bit errors to be corrected before they become double-bit errors.
>
> I'm not familiar enough with UVM internals to actually know what to do but I 
> hope someone else reading this is.
>
> Comments anyone?


Home | Main Index | Thread Index | Old Index