tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: what to do on memory or cache errors?



> besides panicing, of course.

Ideally, I think...

Corrected error: Usually, log and ignore.  Maybe watch for elevated
levels of corrected errors and disable either the containing page or
the containing memory stick, depending on how much the hardware lets
the kernel determine and maybe policy sysctls.  Maybe even allow
paranoid sysadmins to configure "elevated levels of" to mean "any".

Uncorrectable error: Log.  Disable the containing page and/or stick, as
mentioned above.  If it's for the contents of a dirty page, about all
we can do is deliver a memory-error signal.  If it's for a clean page
(including (most) instruction-stream fetches), re-fetch the virtual
page into a new physical page and carry on.

> This is going to involve a lot of help from UVM.

Probably.  Maybe the pmap, too, for things such as figuring out what
regions of RAM would have to be disabled to stop using the affected
memory stick, or the like.

> If uvm_page_error can't "correct" the error, it would panic.

I'd recommend doing that only for kernel accesses; for userland, I'd
much prefer to blow up at most the process incurring the fault.

> Preemptively, we could have a thread force dirty cache lines to
> memory if they've been in L2 "too long" (thereby reducing the problem
> to an ECC error on a clean cache line which means you just toss the
> cache-line contents.)

Depends.  Are we talking ECC on L2 cache, or on main memory?  I'd say
the results should be different.

> We can also have a thread that reads all of memory (slowly) thereby
> causing any single bit errors to be corrected before they become
> double-bit errors.

Well, to be detected.  Whether the correct action upon detecting them
is to silently correct them is a policy matter I'd prefer to avoid
wiring into the kernel.

> I'm not familiar enough with UVM internals to actually know what to
> do but I hope someone else reading this is.

Me neither.  I have just about zero idea how implementable any of the
above is; I've been speaking in ideal generalities.  (My idea of ideal
generalities, that is, of course.)

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index