Subject: libkvm problems on alpha, but not sparc or i386 (netbsd-1-6)
To: NetBSD/alpha Discussion List <port-alpha@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: port-alpha
Date: 09/29/2004 19:37:40
I've been having some problems with various users of libkvm
w.r.t. getting proper proc, vnode, and mount data, etc. from the live
kernel, and I'm unsure if I've done something weird to bugger up my
source tree, or if this is a known bug on alpha, or what.
For example fstat has long giving me the classic "proc size mismatch"
even with a fresh new build, with a fresh new libkvm, running a fresh
new kernel all built at the same time from the same source tree (using a
safe, normal, build.sh style build).
	fstat: proc size mismatch (1496 total, 1256 chunks)
The same source tree has given me no problems on i386 or sparc so I'm
assuming it has something to do with LP64 issues.....
I did some searches of open PRs but didn't find anything that seemed
related or relevant.
So after scratching my head for way too long I decided to try patching
up my local tree with revs 1.58 and 1.59 of fstat.c (to use
kvm_getprocs2()).
As expected the proc list was now readable, but why wasn't it before?
However now the only file descriptors fstat can give any reliable
information about are internet and unix sockets (and perhaps partly
pipes).
# ./fstat
USER     CMD          PID   FD MOUNT       INUM MODE         SZ|DV R/W
root     ksh         3053   wd -         -   ?(fffffc0    -
root     ksh         3053    0* pipe 0xfffffc000d7d7068 <- 0xffffffffffffffff r
root     ksh         3053    1 -         -        none    -
root     ksh         3053    2 -         -        none    -
root     ksh         3053    4 -         -        none    -
root     fstat       3052   wd -         -   ?(fffffc0    -
root     fstat       3052    0 -         -        none    -
root     fstat       3052    1* pipe 0xfffffc000d7d73b0 -> 0xffffffffffffffff w
root     fstat       3052    2 -         -        none    -
root     fstat       3052    3 -         -        none    -
root     fstat       3052    4 -         -        none    -
root     fstat       3052    5 -         -        none    -
root     fstat       3052    6 -         -        none    -
root     fstat       3052    7 -         -   ?(fffffc0    -
root     fstat       3052    8 -         -        none    -
root     xterm       1045   wd -         -   ?(fffffc0    -
root     xterm       1045    0* internet stream tcp fffffc0014ae3ca0 204.92.254.24:514 <-> 204.92.254.3:929
root     xterm       1045    1* internet stream tcp fffffc0014ae3ca0 204.92.254.24:514 <-> 204.92.254.3:929
root     xterm       1045    2* pipe 0xfffffc000d7d68c0 -> 0xffffffffffffffff w
root     xterm       1045    3 -         -        none    -
root     xterm       1045    4 -         -        none    -
root     xterm       1045    5* internet stream tcp fffffc0014ae3ab8 204.92.254.24:65454 <-> 204.92.254.3:6000
root     rshd        1043   wd -         -   ?(fffffc0    -
root     rshd        1043    3* internet stream tcp fffffc0014ae38d0 204.92.254.24:1016 <-> 204.92.254.3:928
root     rshd        1043    4* pipe 0xfffffc000d7d6f50 <- 0xffffffffffffffff rn
[[ .... ]]
root     inetd        262   wd -         -   ?(fffffc0    -
root     inetd        262    0 -         -        none    -
root     inetd        262    1 -         -        none    -
root     inetd        262    2 -         -        none    -
root     inetd        262    3* unix dgram fffffe000032a480 <-> fffffe0000288580
root     inetd        262    4* internet stream tcp fffffc0011f82f40 *:21
root     inetd        262    5* internet stream tcp fffffc0011f83128 *:23
root     inetd        262    6* internet stream tcp fffffc0011f83310 *:514
root     inetd        262    7* internet stream tcp fffffc0011f834f8 *:513
root     inetd        262    8* internet stream tcp fffffc0011f836e0 *:79
root     inetd        262    9* internet stream tcp fffffc0011f838c8 *:113
root     inetd        262   10* internet stream tcp fffffc0011f83ab0 *:17
root     inetd        262   11* internet dgram udp fffffc000fb010e0 *:518
root     inetd        262   12* internet stream tcp fffffc0011f83c98 *:7
root     inetd        262   13* internet stream tcp fffffc0014ae2008 *:9
root     inetd        262   14* internet stream tcp fffffc0014ae21f0 *:13
root     inetd        262   15* internet stream tcp fffffc0014ae23d8 *:37
root     inetd        262   16* internet dgram udp fffffc000fb013b0 *:7
root     inetd        262   17* internet dgram udp fffffc000fb01440 *:9
root     inetd        262   18* internet dgram udp fffffc000fb014d0 *:13
root     inetd        262   19* internet dgram udp fffffc000fb01560 *:37
Similarly systat can't read the mount table for reasons I can't quite
figure out.  After adding some better error checking to the code the
best I could get were some new error messages and a whole bunch of
garbage in the second half of the bufcache display.
As well vmstat has some similar breakage too:
# vmstat -H 
                    total     used     util      num  average  maximum
hash table        buckets  buckets        %    items    chain    chain
bufhash             16384      499     3.05      535     1.07        3
vmstat: kptr 37: hash chain corrupted: kvm_read: Bad address
Pstat seems to be able to print open files though, but not vnodes:
# pstat -T
146/13196 files
pstat: vnode size mismatch
# pstat -v 
pstat: vnode size mismatch
# pstat -f 
146/13196 open files
       LOC       TYPE    FLG     CNT  MSG        DATA        OFFSET
fffffc000d247638 inode       WA    1    0  fffffc001492e948  0             
fffffc000d2476c8 inode       WA    1    0  fffffc001492eac0  6290          
fffffc000d246048 inode       RW    3    0  fffffc001503cf38  1152          
fffffc000d2473b0 inode       RW    3    0  fffffc000f50b638  0             
fffffc000d247830 socket      RW    1    0  fffffc0013a19850  0             
[[ .... ]]
I'm guessing pretty much everything that still uses kvm_read() is
busted.
It's almost as if some commonly used data type is a different width
inside the kernel and out, or maybe /dev/kmem is busted, or maybe
something's wrong with the nlist reader, or....
I haven't made any local changes to any of the kernel data structures in
question, nor any of the type definitions, nor as far as I can tell to
anything else that could be related, and since as I say all works well
on i386 and sparc from the same source tree I'm at a bit of a loss.
Any hints or clues or suggestions about further tests I could do would
be much appreciated.  Debugging some system-level stuff is a bit of a
nightmare as-is....
-- 
						Greg A. Woods
+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>