Subject: Observations on the SVR4/ELF problem
To: None <port-sparc@netbsd.org>
From: Charles M. Hannum <root@ihack.net>
List: port-sparc
Date: 11/29/1999 17:14:30
As Christos pointed out, a -current ELF kernel has intermittent
lossage running Solaris executables.  Typically, I'll either see
(paraphrased, since the machine is running Solaris right now):

# ./foo
./foo: ld.so.1: libelf.so.1: corrupt or truncated file
Killed
# ./foo
[works fine]
[...]
# cp -p foo .foo && mv .foo foo
# ./foo
./foo: ld.so.1: libelf.so.1: corrupt or truncated file
Killed
# ./foo
[works fine]
[...]

or:

# ./foo
./foo: ld.so.1: libc.so.1: unknown file type or format
Killed
# ./foo
./foo: ld.so.1: libelf.so.1: unknown file type or format
Killed
# [...]

No Solaris programs I tried worked the first time.

I observe a few things about this lossage:

1) A -current kernel built in a.out format, with tools from around
   05/19, does not experience this problem.

2) A kernel built from 05/20 sources with ELF tools does experience
   the problem (but see below).

3) Adding `-mno-fpu' to the kernel compile changes the nature of the
   lossage somewhat.  In particular, after copying the file, running
   the copy does *not* cause it to lose the first time.  And after one
   executable has died with `corrupt or truncated file', all the ones
   that would have lost with that failure appear to work.

4) I tried aggressively flushing the TLB, and also aggressively saving
   FPU state and disabling the FPU inside the kernel.  Neither of
   these things appeared to solve the problem or even elicit any
   useful information.  This suggests, among other things, that the
   kernel is not actually using floating point.  (Thus one wonders why
   `-mno-fpu' seems to change the behavior!)