Port-macppc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Userland instability in NetBSD 6.0.1 MacPPC

>> HOWEVER, I noticed that if I run the exact program with the same
>> input twice, I get different crazy numbers. (!!)
>My first inclination would be to suspect flaky hardware.
>> This may well be due to some bug in analog where it is referencing
>> some uninitialized data that just happens to be different on every
>> run.
>> It occurs to me, though that if single threaded (and analog is old,
>> so I would expect that), even bugs should be deterministic.
>True as far as it goes.  But...
>> I wonder if the "different answers on different runs" might be caused
>> by some OS behavior where it is not properly zeroing new vm pages, or
>> some other anti-social, but not fatally incorrect behavior.
>...this, while perhaps possible, is rather unlikely.  But there is
>something I've seen called address space layout randomization, which
>tries to put the various pieces of the address space at different
>adresses each run.  It's intended, AIUI, to mostly-defeat
>code-injection malware that has fixed addresses and/or offsets wired
>into it.  If NetBSD has anything of the sort (you said 6.0.1, so it's
>not a version I know), this could mean that the trash left on the stack
>from one routine call to the next can differ from run to run.
>> I have seen some strange behavior that seems non-reproducable, though
>> it's hard to tell when bringing up a new box and debugging 12 things
>> at once.
>So true.

I have taken my "test case", which consists of a few 10s of megabytes of
log data and some scripts, and copied it to another machine.

The two machines are both PowerMac G4 towers, and are similar "quicksilver"
machines with roughly 700 Mhz CPUs.  One has more memory than the other.

They are named "charm" and "mercy".  Mercy is my production server, so
I am not free to futz with it.  charm is a "test" machine.

Mercy has 3 disks, 2 on the internal ata connector to the mboard.
Charm has only two disks, both on the internal ata connector.

I ran the test case 40 times on charm.  I have about 20 runs done on
mercy. There are two test cases, each to be run 20 times.  I'll call them
case 1 and case 2.

Both machines are behaving similarly.  Of the 40 runs on charm, I had 3
failures each on case 1 and case 2.  So far on mercy, I have seen only 2 
failures of case 2, but the runs are not done.

The symptom of the failure is that the statistics are wrong.  Analog
computes html output with lots of graphs and stuff.  What happens in
most of the failures is that the total bytes moves, which should be
29.54 GBytes, is something either much larger, or a huge negative number.
I have mofidied the output slightly so it tells me where these big
"anomalies" are coming from, and it is clear that some small number of
"inputs" is coming up as some garbage number.  For instance, the bar graphs
might all show "0%", except for a particular reference with "100%" of the
bytes transferred, and the grand total might be "450 exabytes".

The interesting thing about this is that every failed run clearly hits the error
in a different place.  No two alike.

In sum, I think I have ruled out hardware per-se.

I am now trying the same test on a VM running NetBSD 6.0.1 on an
x86_64 VM that is running on a Mac OS X 10.6 system with VMWare.

I'll report back.


Home | Main Index | Thread Index | Old Index