Subject: Re: poor error reporting when ld.so is missing
To: Greg A. Woods <woods@weird.com>
From: Emmanuel Dreyfus <p99dreyf@criens.u-psud.fr>
List: tech-kern
Date: 04/16/2001 22:45:02
Greg A. Woods <woods@weird.com> wrote: 

> The correct message would hopefully look something like:
> 
>    <basename of argv[0]>: /usr/libexec/ld.elf_so: No such file or directory

Yes, I think this is what we want. But I don't see how we can do this in
a clean way.

As far as I understand (I learnt this with Linux compatibility), things
happen this way:

1) The shell calls libc's exec(), system(), execve(), ot whatever, whith
the binary to execute as argument

2) libc's exec() fires the exec() system call, with the name of the
binary to execute as the argument.

3) exec calls various function to discover how to run the program. Is it
NetBSD ELF, NetBSD a.out, Linux ELF, FreeBSD a.out?

4) For dynamically linked executables, theses test functions check that
the interpreter (ld.so) exist. For ELF programs, the interpreter name is
taken from the .interp ELF section. I don't know how it works for a.out.

5) If the interpreter exists, then the program is matched for the given
architecture. otherwise, the test fails, and the kernel proceed for the
next test (eg: it's not a Linux binary, let's try FreeBSD)

6) The stack is set up for program launch, and ld.so is loaded into
memory, the kernel then passes control to ld.so, and ld.so will link and
launch the program. For statically linked programs, the kernel just
launch the program.

We can improve things at step 5: if the interpreter cannot be found,
there is no need to perform another test. We just fail to run the
program, and we return ENOENT to userland.

But this will not tell the user that ld.so was missing. And as far as I
know, the userland (the calling program or the libc) does not know the
name of the interpreter. So it's not possible to tell that
"/usr/libexec/ld.elf_so: No such file or directory"

Here are the different options I see:

- A new error code, and the calling program says that the dynamic linked
was not found. We don't say the name of the missing file, and this needs
a new error code. This is bad.

- We modify the exec() system call so that it returns a string with the
intepreter name on failures. The libc exec() call that the userland sees
remains unchanged, of course. On ENOENT error, the libc exec() call know
the interpreter name, and it can display an error telling what's
happening. It's bad too since the libc should not output an error, it's
the job of the calling program. And the exec() ssytem call modification
seems a bad hack

- On ENOENT error, the libc digs the interpreter name and compains it
does not exist. This is bad because we have to reproduce some code that
already exist in the kernel, and one more time, is it libc's job to
output the message (but do we have any choice here for reporting an
accurate message?)

I don't see any clean way of getting the accurate error message. Any
idea about this?
-- 
Emmanuel Dreyfus
p99dreyf@criens.u-psud.fr