Subject: Re: sparc64 / 2.0.1 and thread crashes (was: Re: Ultra 5 / 2.0 / panic: lockmgr: no context)
To: Michael <macallan18@earthlink.net>
From: Gert Doering <gert@greenie.muc.de>
List: port-sparc64
Date: 02/08/2005 09:37:36
Hi,

to summarize and close down this thread...

  - my U5 with 2.0.1 kept crashing every night in "low-duty" periods,
    with varying kernel error messages - sometimes even with RED STATE
    and SIR Reset messages that pointed to hardware issues

  - I've swapped nearly all relevant parts, and it didn't change the
    problem

  - reducing the amount of RAM from 512 Mb -> 256 Mb made the problem
    appear much faster (crash after about 2-3 hours), which made me assume
    "it has to do something with threads and swapping"  (the "low-duty"
    period where it usually crashed is related to Amanda backup - which
    needs LOTS of RAM, so it's likely that other processes got swapped
    out, and the next time they're needed -> *boom*)

  - tried running without swap for one night: no crash, but lots of 
    processes died due to out of memory situation (amanda's fault) - so
    this seemed to confirm that it's not a hardware problem, but 
    "threads+swapping" indeed.

  - only two things on the system use threads: perl, and clamav-milter

  - rebuilt perl (5.8.x) and all perl modules to use non-threaded perl
    (clamav-milter still using native threads)
    -> didn't help, machine still crashed -> so it wasn't perl/spamd

  - rebuilt kernel with a backport of the -current thread changes
    (the L_SA_SWITCHING stuff), rebuilt libpthreads.so with 
    PTHREAD_MLOCK_KLUDGE.  Rebooted this kernel, waited.

    Machine did not crash "in the usual way", but it ended up being
    unusable in other ways: 

      * "top" displayed "[ioflush]" taking 100% CPU usage (indefinite)
      * typing "sync" made "sync" appear in top, sharing 100% CPU usage
        with "[ioflush]" (both using 50%, obviously)
      * trying to umount a not-in-use filesystem (to see whether it would
        trigger anything) led to "umount" hanging, consuming CPU
      * assuming that "clamav-milter" might be the culprit, I tried to
        kill it.  Various signals were ignored, "kill -9" led to a kernel
        fault:

data fault: pc=11a0434 addr=0
kernel trap 30: data access exception
Stopped in pid 27070.1 (kill) at        netbsd:lwp_continue+0x20:       ld              [%l0 + 0x44], %g1
db> 

    -> so I need to assume that the current thread fixes *do* fix "sparc" 
    (as has been reported by others) but not yet "sparc64".


  - as a last measure, I've rebuilt libmilter.a and clamav-milter to
    use GNU pth (from pkgsrc) and am now running that combo, and *no*
    processes that use native pthreads anymore.  

    Since then, the machine has NOT crashed a single time.
   
    kirk$ uptime
      9:33AM  up 2 days, 13:58, 10 users, load averages: 0.98, 0.63, 0.56

    (which is not something one would usually be proud of, but since the
    machine has crashed every single night for the last 4 weeks, it seems
    to be the break through)


I hope this summary is useful for someone out there :-)

If there are specific additional sparc64/-current patches that I should
test, just tell me.  I have a different machine available (U10) that is 
used as a work station, and fairly reliable crashes when running Mozilla 
(native pthreads) while building a NetBSD world, or doing a "CVS update" 
on the NetBSD src tree.

gert

-- 
USENET is *not* the non-clickable part of WWW!
                                                           //www.muc.de/~gert/
Gert Doering - Munich, Germany                             gert@greenie.muc.de
fax: +49-89-35655025                        gert@net.informatik.tu-muenchen.de