RE: what to do on memory or cache errors?

To: <matt%3am-software.com@localhost>, <tech-kern%NetBSD.org@localhost>
Subject: RE: what to do on memory or cache errors?
From: <Paul_Koning%Dell.com@localhost>
Date: Mon, 22 Aug 2011 16:04:57 -0500

I would think that memory errors are far more likely than cache errors.  If a 
CPU gets cache errors, it is very badly broken. 

I'm not sure it's worth doing anything other than panic for cache errors.  

For memory errors, if you can get the failing address (which some CPUs can do 
and some cannot) and you can associate that address with some process, then you 
might kill that process instead of panicking.  Again, I'm not sure how valuable 
that would be.  For highly fault tolerant control systems, perhaps.  For 
anything else, not clear.  Also, a highly fault tolerant system may well use  
replicated CPUs, in which case having one CPU panic simply means the other one 
takes over.

In short, is there a reason to change anything?

        paul

-----Original Message-----
From: tech-kern-owner%NetBSD.org@localhost 
[mailto:tech-kern-owner%NetBSD.org@localhost] On Behalf Of Matt Thomas
Sent: Monday, August 22, 2011 4:58 PM
To: tech-kern Discussion List
Subject: what to do on memory or cache errors?


besides panicing, of course.

This is going to involve a lot of help from UVM.  

It seems that uvm_fault is not the right place to handle this.  Maybe we need a

void uvm_page_error(paddr_t pa, int etype);

where etype would indicate if this was a memory or cache fault, was the cache 
line dirty, etc.  If uvm_page_error can't "correct" the error, it would panic.

Interactions with copyin/copyout will also need to be addressed.

Preemptively, we could have a thread force dirty cache lines to memory if 
they've been in L2 "too long" (thereby reducing the problem to an ECC error on 
a clean cache line which means you just toss the cache-line contents.)  We can 
also have a thread that reads all of memory (slowly) thereby causing any single 
bit errors to be corrected before they become double-bit errors.

I'm not familiar enough with UVM internals to actually know what to do but I 
hope someone else reading this is.

Comments anyone?

Follow-Ups:
- Re: what to do on memory or cache errors?
  - From: Matt Thomas

References:
- what to do on memory or cache errors?
  - From: Matt Thomas

Prev by Date: what to do on memory or cache errors?
Next by Date: Re: what to do on memory or cache errors?
Previous by Thread: what to do on memory or cache errors?
Next by Thread: Re: what to do on memory or cache errors?
Indexes:

Home | Main Index | Thread Index | Old Index