Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

10.1 panic under load, maybe memory corruption



Hello,
I have a host (Dell poweredge) serving linux distributions and other
open source softwares. For some days it's under very heavy load (I guess
because of new releases of several independant softwares), mostly
for http traffic. load average is frequently above 800 in the day.
Most of the requested data are in cache so disk activity isn't that
high (maybe even less than usual), but vnode activity is high.
It's currently running 10.1_STABLE from march, 25.

I ran into several issues:
- several hard hangs, with network and serial port console unsresponsive.
  I could enter ddb; but ps and stack trace didn't show anything obvious.
  By lack of time I didn't investigate more.
- on 2 or 3 occasions, a disk timeout (it's a mfii RAID controller)
  followed by a bad pointer dereference in bus_dma. Looks like corrupted
  data in the bus_dma structures, which maybe could explain that the
  controller didn't handle it (or that the driver failed to track it
  properly). Unfortunably it seems that I have lost the file where
  I did take notes
- The last one is:
  [ 445171.1853765] uvm_fault(0xffffffff80e82000, 0xffffa95199830000, 2) -> e
  [ 445171.1971013] fatal page fault in supervisor mode
  [ 445171.2079205] trap type 6 code 0x2 rip 0xffffffff8066dadd cs 0x8 rflags 0x10282 cr2 0xffffa95199830ff0 ilevel 0x3 rsp 0xffff859b20d8ae50
  [ 445171.2271570] curlwp 0xffffa9510b95d8c0 pid 0.4 lowest kstack 0xffff859b20d862c0   
  db{0}> tr
  uvmpdpol_pagerealize() at netbsd:uvmpdpol_pagerealize+0x3d
  uvm_aio_aiodone_pages() at netbsd:uvm_aio_aiodone_pages+0x1a6
  uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xb9
  dkiodone() at netbsd:dkiodone+0xb9
  biointr() at netbsd:biointr+0x61
  softint_dispatch() at netbsd:softint_dispatch+0x11c
  DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0xffff859b20d8b0f0
  Xsoftintr() at netbsd:Xsoftintr+0x4c
  --- interrupt ---
  321b8204dab4e58e:

reboot after panic takes about 3 hours, so experiments isn't that easy.
I don't know where to start looking at.
Does this ring a bell to someone ? 

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--


Home | Main Index | Thread Index | Old Index