[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
RE: what to do on memory or cache errors?
I would think that memory errors are far more likely than cache errors. If a
CPU gets cache errors, it is very badly broken.
I'm not sure it's worth doing anything other than panic for cache errors.
For memory errors, if you can get the failing address (which some CPUs can do
and some cannot) and you can associate that address with some process, then you
might kill that process instead of panicking. Again, I'm not sure how valuable
that would be. For highly fault tolerant control systems, perhaps. For
anything else, not clear. Also, a highly fault tolerant system may well use
replicated CPUs, in which case having one CPU panic simply means the other one
In short, is there a reason to change anything?
[mailto:tech-kern-owner%NetBSD.org@localhost] On Behalf Of Matt Thomas
Sent: Monday, August 22, 2011 4:58 PM
To: tech-kern Discussion List
Subject: what to do on memory or cache errors?
besides panicing, of course.
This is going to involve a lot of help from UVM.
It seems that uvm_fault is not the right place to handle this. Maybe we need a
void uvm_page_error(paddr_t pa, int etype);
where etype would indicate if this was a memory or cache fault, was the cache
line dirty, etc. If uvm_page_error can't "correct" the error, it would panic.
Interactions with copyin/copyout will also need to be addressed.
Preemptively, we could have a thread force dirty cache lines to memory if
they've been in L2 "too long" (thereby reducing the problem to an ECC error on
a clean cache line which means you just toss the cache-line contents.) We can
also have a thread that reads all of memory (slowly) thereby causing any single
bit errors to be corrected before they become double-bit errors.
I'm not familiar enough with UVM internals to actually know what to do but I
hope someone else reading this is.
Main Index |
Thread Index |