Port-macppc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Userland instability in NetBSD 6.0.1 MacPPC



>> HOWEVER, I noticed that if I run the exact program with the same
>> input twice, I get different crazy numbers. (!!)
>
>My first inclination would be to suspect flaky hardware.
>
>> This may well be due to some bug in analog where it is referencing
>> some uninitialized data that just happens to be different on every
>> run.
>
>> It occurs to me, though that if single threaded (and analog is old,
>> so I would expect that), even bugs should be deterministic.
>
>True as far as it goes.  But...
>
>> I wonder if the "different answers on different runs" might be caused
>> by some OS behavior where it is not properly zeroing new vm pages, or
>> some other anti-social, but not fatally incorrect behavior.
>
>...this, while perhaps possible, is rather unlikely.  But there is
>something I've seen called address space layout randomization, which
>tries to put the various pieces of the address space at different
>adresses each run.  It's intended, AIUI, to mostly-defeat
>code-injection malware that has fixed addresses and/or offsets wired
>into it.  If NetBSD has anything of the sort (you said 6.0.1, so it's
>not a version I know), this could mean that the trash left on the stack
>from one routine call to the next can differ from run to run.
>
>> I have seen some strange behavior that seems non-reproducable, though
>> it's hard to tell when bringing up a new box and debugging 12 things
>> at once.
>
>So true.
>

I ran my test case on my x86_64 VM.  I did 60 runs, and none failed.
Rock solid.

I have another non-Quicksilver PPC machine, but I don't have time to pursue
this more right now.

I have packed up my test case into a 55 MByte tgz file at:

ftp mercy.icompute.com pub drop analog.bug.tgz (add slashes for spaces)

(ftp command is cryptic to avoid crawlers finding this file.  It's not
high security, but I don't want it smeared all over the net)

Any machine with analog can run the test.  The scripts are a little cryptic,
but you change the input and output directories in stattmp, and then
use runN 1 5 to do 5 runs.  (The "nums" script" is below)

My nickel is on this being an OS problem of some sort - cache flush, I/O
timing, VM page locking.  Something easy to find and fix. <snicker>

I can't see how it could be in the VM or I/O subsystems without showing up
elsewhere, though.  It's a mystery......

By the way.... It **appears** to happen less frequently when the CPU is 
otherwise
fairly idle.  It seems to trigger more failures is I am pulling one of the
big files into vi while I run the test case.  It still only fails about 1 in
10 runs, though.  My big production runs that take 2 hours to run _all_ fail.

-dgl-


$ cat ~/bin/nums
#!/bin/ksh

if [ $# -ne 2 ] ; then
        echo "usage: $0 start end"
        exit 1
fi

start=$1
end=$2

i=$start
while [ $i -le $end ] ; do
echo $i
i=`expr $i + 1`
done


Home | Main Index | Thread Index | Old Index