Port-macppc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Smoking gun: NetBSD 6.0.1 userland instability



I finished an experiment today.  6.0 kernel fails.  5.2 kernel works.
It looks highly likely that the 6.0.1 (and 6.1_RC1) kernel is broken.

I have been running a NetBSD 6.0.1
system on my production web server, which uses analog to process the
web server statistics.  Analog has been giving me odd results, but worse,
I can run analog on the same input 10 times and get 10 different outputs.

The symptom is that the statistics would get enormous numbers in the
totals.  For instance, rather than 25 GBytes total, it might be 5628 Exabytees,
and certain days (different days on different runs) would show these huge
bogus totals.

I tried running the same software on two different instances of MacPPC G4
hardware, and they behaved the same.  I also tried swapping in a
NetBSD 6.1_RC1 kernel with the same test case.  The system runs fine with
the RC1 kernel, but the test behaved the same, yielding incorrect results.

I tried swapping in a 5.2 kernel under my 6.0.1 system, but init died,
which was not unexpected.  (I can hope!)

Today, I installed a clean NetBSD 5.2 on one of my test machines, and installed
just the analog package (from the 5.1 package build).  Then I ran my
test case with plenty of iterations to exhibit the problem.

It came through clean.

I think I have established that there is something broken in the 6.0.1
kernel that screws up analog's results.  Whatever it is, it is not broken
in 5.2.

If I may opine......

This is unlikely to be an I/O, or VM problem.  It is hard to imagine a machine
remaining stable with those core machinery not working properly.  My
test case also does not stress the memory on the test machine, so this does
not appear to be a swap/paging related problem.  My test case seems to
depend only on the length of time it runs, exhibiting an error for every
N seconds of calculation time.

I also have doubts about cache because things like gzip would be flaky
if cache were screwed up.  It is hard to imagine anything being halfway
stable with a significant cache bug.

I **believe** that analog is single threaded.  It's old code, and with a single
thread, it should be downright deterministic, and should not use any
fancy threading calls.

My bet - FWIW - is that the problem is in the altivec handling.  This is
something unique to PPC processors, differs between PPC models, and would
only affect code that uses altivec.  "normal" code - like compilers, gzip,
I/O, and lots of other things, could go a long way with altivec broken,
and no one would notice.

Take my advice/speculation for what you paid for it. ;->

I hope that someone can figure this out and fix it.  I posted my test case
already, and can provide more information to anyone who wants to re-run it.
See: ftp mercy.icompute.com pub drop analog.bug.tgz

The test case takes only about 20 minutes to reliably show the problem.

-dgl-



Home | Main Index | Thread Index | Old Index