tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: what to do on memory or cache errors?
On Aug 22, 2011, at 2:04 PM, <Paul_Koning%Dell.com@localhost>
<Paul_Koning%Dell.com@localhost> wrote:
> I would think that memory errors are far more likely than cache errors. If a
> CPU gets cache errors, it is very badly broken.
Probably true but.
> I'm not sure it's worth doing anything other than panic for cache errors.
Specifically uncorrected cache errors on a dirty line. If the cache line was
clean, you could just clear it and keep going. You might also want to keep a
bitmap of cache lines to see cache errors keep happening for the same cache
line.
> For memory errors, if you can get the failing address (which some CPUs can do
> and some cannot) and you can associate that address with some process, then
> you might kill that process instead of panicking. Again, I'm not sure how
> valuable that would be. For highly fault tolerant control systems, perhaps.
> For anything else, not clear. Also, a highly fault tolerant system may well
> use replicated CPUs, in which case having one CPU panic simply means the
> other one takes over.
If ECC error was in a page backed by the vnode-pager, you could just unmap the
errant page, refill with zeros (fixing ECC), return it to a free list, and let
whoever wanted the page fault the contents back in.
> In short, is there a reason to change anything?
I don't know. Which is why I'm asking.
Home |
Main Index |
Thread Index |
Old Index