Subject: update to "panic: bad dir"
To: None <port-alpha@NetBSD.ORG>
From: Martin Grossman <grossman@BBN.COM>
List: port-alpha
Date: 06/05/1998 09:25:58
Here's an update to the problem I sent out last Wed (5/27):

PS This is netbsd for alpha V1.2 as of Aug '97

1) gcc is doing an
   access("/usr/local/lib/gcc-lib/alpha-unknown-netbsd1.2G/2.7.2.2/specs

2) kernel is in ufs_lookup() and prints panic msg "bad dir"..."mangled entry"

3) in kernel dump......
	a) kernel is in namei and is about to search dir 2.7.2.2 (inode #7772)
	   data block for directory entry containing "specs"
	b) vnode (vdp) for directory 2.7.2.2 is 100% correct
	c) vnode points to inode (dp), and that is 100% correct
	   even the 1 and only data block 34967(10) agrees with inode on disk
	   (ie I read up on the ffs, and used dd to read in inode 7772, and
	       modes/links/times/First direct block # all agree with dump)
	d) inode points to a buf header (bp) which is 100% correct!
	   b_flags = 0100230 = B_READ | B_DONE | B_CACHE | B_BUSY
	   B_CACHE is probably set because a user was running a make in a large
	   directory, and just about everything gcc last touched is still in the
	   cache. (ie this system has 512MB of mem, and disk buffer cache is
	   50MB (ie nbuf=6523), and at time of crash only 2 users were on system.)

	***) The problem is....the data buffer that the buf header points to
	     (ie b_addr) doesn't contain the contents of the directory!!!!!!
	      it has the first 512 bytes (length of this directory) of an
	      executable program (ie starts with an ELF header) thus we get
	      mangled directory entries!

*) Since everything else is correct, and disk buffer cache is pageable, is it
   possible that between the last run of gcc (when it did the same access()
call
   and this access() call, the buffer was written out to page area, and the
   wrong page read back in?  I thought the text segments of executable files
   were not written out to the page area!  I thought they were just freed, and
   when needed are read back in directly from the ELF executable file!
   (ie demand paged).

PS only major VM parameter changed in system is kmem_map has been increased
   from its GENERIC setting to 32MB (ie heap).  All the other maps
   (ie segmap, pagemap, pager, buffers, exec, phys, mb, sysv are still their
       GENERIC sizes).

>
>We are getting alot of these panics on 1 system, and a few on other systems.
>
>All systems are exactly the same except the user load!
>All are PC164 DEC motherboards with 512MB mem and a NCR scsi to a
>10GB (WIDE) disk.
>
>It seams to happen more often when high user load, and high NFS (client)
>traffic.
>
>It has happened on both local and NFS directories.
>
>OUTPUT on console (and in /var/log/messages) (and in kernel dumps)
>
>1) First bad
>2) /usr: bad dir ino 7772 at offset 0: mangled entry
>3) panic: bad dir
>
>#1 is comming from ufs_lookup.c  ufs_dirbadentry() because ep->d_reclen
>   isnot a multiple of 4
>#2 is comming from ufs_lookup.c ufs_dirbad().
>   a) I've seen "/", "/var", "/usr", and "/nfs/XXX/u1"   (first 3 are UFS)
>   b) various inodes (7772 is 4 levels deep below /usr)
>   c) its always at offset 0
>
>>From running gdb -k /netbsd.1 /netbsd.1.core
>
>I've figured out this much so far.....
>
>1) we are in ufs_lookup() from an access() call
>(ie backtrace is syscall,sys_access,namei,lookup,ufs_lookup,ufs_dirbad,panic)
>
>2) 8 lines after label searchloop: in call to
>VOP_BLKATOFF(vdp,dp->i_offset,NULL,&bp)
>   I do a print *vdp (vnode) and everything looks right
>   dp->i_offset is zero (which is fine)
>   I do a print *dp (inode) and everything looks right
>   I do a print *bp (buf) and everything looks right
>   I do a print *ep (dirent) (ie bp->b_data) and its nothing like a
>directory entry!
>
>	It should contain an inode #, reclen, type, namelen, and a name
>
>			BUT
>
>	it contains	0x464c457f
>			0x00010102
>			0x00000000
>			0x00000000
>			0x90260002
>			0x00000001
>			0x00230000
>			0xfffffc00
>			0x00000040
>			0x00000000
>
>This is the beginning of some ELF executable file!!!!!
>
>Is there any known bug (fixed or not) in or around the disk buffer cache?
>
>PS We are running NetBSD 1.2G (November 1997).
>
>
>