Subject: stuck with compat_svr4
To: None <port-sparc@netbsd.org>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: port-sparc
Date: 11/14/2001 16:00:15
Hi,
I'm trying to get a solaris vendor daemon to run on a NetBSD/sparc machine.
I'm now stuck at a strange problem.

The daemon forks a second process, and the 2 processes communicate though a
TCP socket I think (it's the fd returned by a open("/dev/tcp")).
Basically the child exists because it can't receive data from the parent,
but it doesn't even try to read data.


A truss on solaris shows:
25811:  write(11, 0xEFFFF614, 147)                      = 147
25811:  sigaction(SIGALRM, 0xEFFFF300, 0xEFFFF380)      = 0
25811:  setitimer(ITIMER_REAL, 0x000F8E90, 0x000F8E90)  = 0
25811:  poll(0xEFFFD348, 1, 10000)                      = 1
25811:  sigaction(SIGALRM, 0xEFFFF300, 0xEFFFF380)      = 0
25811:  setitimer(ITIMER_REAL, 0x000F8E90, 0x00000000)  = 0
25811:  poll(0xEFFFD2B8, 1, 10000)                      = 1
25811:  read(11, 0xEFFFF580, 147)                       = 147
25811:  sigaction(SIGALRM, 0xEFFFF300, 0xEFFFF380)      = 0
25811:  setitimer(ITIMER_REAL, 0x000F8E80, 0x00000000)  = 0
25811:  sigprocmask(SIG_BLOCK, 0xEFFFF3C0, 0xEFFFF450)  = 0

The write() is sending data to the parent, the parent reads it and send anserw
on the socket. The read() is the child reading the anserw.

A ktrace on NetBSD shows (I prefer to not expose the processes names here :):
  2801 food  CALL  write(0xa,0xeffff464,0x93)
  2801 food  GIO   fd 10 wrote 147 bytes
  2801 food  RET   write 147/0x93
(The child sent data to the parent, ktrace of the parent shows that it received
it properly, and wrote the anserw to the socket).
  2801 food  CALL  sigaction(0xe,0xeffff150,0xeffff1d0)
  2801 food  RET   sigaction 0
  2801 food  CALL  setitimer(0,0xf8ea0,0xf8e90)
  2801 food  RET   setitimer 0
  2801 food  CALL  sigaction(0xe,0xeffff150,0xeffff1d0)
  2801 food  RET   sigaction 0
  2801 food  CALL  setitimer(0,0xf8e90,0)
  2801 food  RET   setitimer 0
  2801 food  CALL  sigprocmask(0x1,0xeffff188,0xeffff218)
  2801 food  RET   sigprocmask 0

It doesn't even try to poll()/read() !  But the system calls around looks the
same. A few syscall later the process writes the error message to stderr
("can't communicate with parent") and exits.

Any idea on how to track this down ? I suspect the program is calling
a library function, and that the 'skip poll/read' isn't happening in the
program itself but in a library. This program doesn't use any custom shared
lib.
I tried solaris 2.5.1 and 2.7 libraries, as well as hacking svr4_stat.c so that
uname() returns the exact same values as the solaris host.

--
Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
--