tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: what to do on memory or cache errors?



On Aug 22, 2011, at 2:04 PM, <Paul_Koning%Dell.com@localhost> 
<Paul_Koning%Dell.com@localhost> wrote:

> I would think that memory errors are far more likely than cache errors.  If a 
> CPU gets cache errors, it is very badly broken. 

Probably true but.

> I'm not sure it's worth doing anything other than panic for cache errors.  

Specifically uncorrected cache errors on a dirty line.  If the cache line was 
clean, you could just clear it and keep going.  You might also want to keep a 
bitmap of cache lines to see cache errors keep happening for the same cache 
line.

> For memory errors, if you can get the failing address (which some CPUs can do 
> and some cannot) and you can associate that address with some process, then 
> you might kill that process instead of panicking.  Again, I'm not sure how 
> valuable that would be.  For highly fault tolerant control systems, perhaps.  
> For anything else, not clear.  Also, a highly fault tolerant system may well 
> use  replicated CPUs, in which case having one CPU panic simply means the 
> other one takes over.

If ECC error was in a page backed by the vnode-pager, you could just unmap the 
errant page, refill with zeros (fixing ECC), return it to a free list, and let 
whoever wanted the page fault the contents back in.

> In short, is there a reason to change anything?

I don't know.  Which is why I'm asking.


Home | Main Index | Thread Index | Old Index