netbsd-bugs: Re: port-dreamcast/34243

Subject: Re: port-dreamcast/34243
To: None <port-dreamcast-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: Valeriy E. Ushakov <uwe@ptc.spbu.ru>
List: netbsd-bugs
Date: 08/22/2006 06:05:04
The following reply was made to PR port-dreamcast/34243; it has been noted by GNATS.

From: "Valeriy E. Ushakov" <uwe@ptc.spbu.ru>
To: gnats-bugs@netbsd.org
Cc: Yasushi Oshima <oshima-ya@yagoto-urayama.jp>
Subject: Re: port-dreamcast/34243
Date: Tue, 22 Aug 2006 06:36:09 +0400

 I can reproduce this problem on my usl-5p (running uncomitted landisk
 port, with Nonaka-san patches integrated into current -current).
 
 The machine would boot all the way to the login prompt, but login
 attempt just never succeeds.  If I boot into single user mode and run
 passwd(1), it just "loops" consuming all the CPU.
 
 Running passwd under gdb (with a slightly tweaked ld.elf_so that does
 debugger handshake before calling .init sections):
 
 # gdb -q passwd
 (no debugging symbols found)...(gdb) run
 Starting program: /usr/bin/passwd
 (no debugging symbols found)...(no debugging symbols found)...
 (no debugging symbols found)...(no debugging symbols found)...
 (no debugging symbols found)...(no debugging symbols found)...
 (no debugging symbols found)...(no debugging symbols found)...
 ^C(no debugging symbols found)...(no debugging symbols found)...
 Program received signal SIGINT, Interrupt.
 0x20690a9c in _init () from /usr/lib/libcom_err.so.4
 (gdb) bt
 #0  0x20690a9c in _init () from /usr/lib/libcom_err.so.4
 #1  0x206907e0 in _init () from /usr/lib/libcom_err.so.4
 #2  0x204234de in _rtld_call_init_functions (first=0x7fffdd50)
     at /usr/src/libexec/ld.elf_so/rtld.c:147
 #3  0x20423466 in _rtld_call_init_functions (first=0x20416e00)
     at /usr/src/libexec/ld.elf_so/rtld.c:141
 #4  0x20423466 in _rtld_call_init_functions (first=0x20416c00)
     at /usr/src/libexec/ld.elf_so/rtld.c:141
 #5  0x20423466 in _rtld_call_init_functions (first=0x20416a00)
     at /usr/src/libexec/ld.elf_so/rtld.c:141
 #6  0x20423466 in _rtld_call_init_functions (first=0x20416800)
     at /usr/src/libexec/ld.elf_so/rtld.c:141
 #7  0x20423466 in _rtld_call_init_functions (first=0x20416600)
     at /usr/src/libexec/ld.elf_so/rtld.c:141
 #8  0x20423466 in _rtld_call_init_functions (first=0x20416400)
     at /usr/src/libexec/ld.elf_so/rtld.c:141
 #9  0x2042427a in _rtld (sp=0x7fffddac, relocbase=541209686)
     at /usr/src/libexec/ld.elf_so/rtld.c:477
 (gdb) x/i $pc
 0x20690a9c <_init+720>: mov.l   @r4,r1
 
 This is inside frame_dummy() called rfom libcom_err.so .init
 
 Continuing passwd and interrupting it again later (it's still stuck)
 ends up in exactly the same location.
 
 Doing stepi over this instruction and continuing makes passwd unstuck
 and it prompts for a new password.
 
 Setting a gdb break point *anywhere* on that page make the passwd work
 even if the breakpoint is never hit.
 
 DDB confirms the picture:
 
 # passwd
 Stopped in pid 15.1 (passwd) at netbsd:cpu_Debugger+0x6: mov r14, r15
 
 db> bt
 cpu_Debugger() at netbsd:scifintr+0x64
 scifintr() at netbsd:intc_intr+0x4a
 intc_intr() at 0x8c000680
 <EXPEVT 000; SSR=00000001> at 0x20690a9c
 db> c
 ^Z[1] + Stopped              passwd
 # jobs -l
 [1] +    15 Stopped              passwd
 # pmap 15
 ...
 20690000      4K read/exec         /usr/lib/libcom_err.so.4.1
 20691000     60K                   /usr/lib/libcom_err.so.4.1
 206A0000      8K read/write        /usr/lib/libcom_err.so.4.1
 ...
 
 Running a kernel with caches disabled doesn't change this failure
 scenario (besides, there are no relocs in vicinity of that
 instruction).
 
 I've tried replacing the kernel and ld.elf_so with the old ones from
 the usl-5p cf image provided by Nonaka-san in /misc on ftp.n.o but
 that doesn't change anything either.
 
 But changing libcom_err.so to the old one (all the rest being
 -current) makes passwd work.
 
 I've tracked the important difference between two instances of
 libcom_err.so to the following:
 
 * OLD (works):
 
 Program Headers:
   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
   LOAD           0x000000 0x00000000 0x00000000 0x010f8 0x010f8 R E 0x10000
   LOAD           0x0010f8 0x000110f8 0x000110f8 0x00130 0x001d8 RW  0x10000
 
 
 * NEW (doesn't work):
 
 Program Headers:
   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
   LOAD           0x000000 0x00000000 0x00000000 0x00fbc 0x00fbc R E 0x10000
   LOAD           0x000fbc 0x00010fbc 0x00010fbc 0x0012c 0x001d4 RW  0x10000
 
 Note that in the new one the second segment starts on the first page
 of the file (0x0fbc < 0x1000).
 
 If I tweak current libcom_err.so to include some dummy read only data
 to artificially inflate the size of the first segment to be larger
 than 1 page the resulting libcom_err.so does work.
 
 I guess that new binutils trigger this bug because they produce a
 shorter .dynsym section (omiting some SECTION entries) and make the
 first loadable segment to be shorter than one page.
 
 An interesting test would be to use a "working" system from before
 binutils upgrade and to replace libcom_err.so.4.1 with the one from
 the current, and see if the bug is triggered.  That should confirm
 that the bug is an old bug in the kernel only made apparent by the
 layout change triggered by new binutils.
 
 -uwe