Re: fsck seg fault failure on vmware -i386?

To: "David Holland" <dholland-current%netbsd.org@localhost>
Subject: Re: fsck seg fault failure on vmware -i386?
From: yancm%sdf.lonestar.org@localhost
Date: Fri, 29 Jan 2010 11:52:03 -0000

I was able to rescue the VM.

Investigation summary to date below.

>> To do this, do 'make cleandir && make dependall DBG=-g' in
>> src/sbin/fsck_ffs on a different machine (so as not to touch
>> the broken volume) then run it.

My only build experience is through build.sh. I tried above but
realized it was going to require time to figure out so I ended up
adding +g to cflags in Makefile and (re)building the distribution...
overkill to be sure, but only required 2 minutes of my time.

I tried several things to investigate the crash. VMs are nice because
I can go back and forth between the failure state snapshot and
"rescued" snapshot AND transmit data back and forth...
let me know what to check next...

Anyway here's what I think I saw...
[let's define VM_Crash and VM_Good as the crash state and rescued state]

1/21/2010
on VM_Crash:
- fsck_ffs built on 1/14/2010 - so --current as of ~1/13
  [#def fsck_ffs_VMC]
- /usr partition /dev/wd0e corrupted and unmounted
- fsck_ffs /dev/Wd0e crashes w/coredump

1/28/2010
on VM_Good
- updated src to 1/28/2010 - made distribution and installed
  [this was to make sure any munge from the rescue was removed]
- Created fsck_ffs [with symbols - #def fsck_ffs_VMG]
- fsck_ffs /dev/wd0e - completed cleanly and exited properly!
- Saved fsck_ffs_VMG to "floppy" and restored crash state
- ran gdb+bt on fsck_ffs_VMG on fsck_ffs_VMC.core [See output below]

1/29/2010
on VM_Crash
- mounted floppy to retrieve fsck_ffs_VMG to /tmp
  [note that the floppy mounted, but was getting a string of console
  error messages about write issues - seemed to get fsck_ffs_VMG
  out OK, but was unable to write - may be a VMware issue?]
- verified that fsck_ffs_VMC /dev/wd0e still crashed - it did
- ran fsck_ffs_VMG /dev/wd0e - it crashed too!?!?
- no gdb since no /usr...
- mounted /usr from VMG to get gdb
- ran gdb fsck_ffs_VMG then run /dev/wd0e - completed SUCCESSFULLY?!?!
- ran gdb+bt on fsck_ffs_VMG on fsck_ffs_VMC.core [Same output as before]

SO...seems maybe fsck_ffs needs something in /usr when it finds issues
and pukes when it can't find it?

Since I seem to be able to repair /usr from VMC, next step is
to copy it back and mount it so /usr is in sync...but doubt that
changes anything...

Comments?

*** APPENDIX ***
debug fsck_ffs_VMG on the fsck_ffs_VMC.core output:

xperiment 28 # gdb ./fsck_ffs fsck_ffs.core
GNU gdb 6.5
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386--netbsdelf"...

warning: exec file is newer than core file.

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libutil.so.7...done.
Loaded symbols for /lib/libutil.so.7
Reading symbols from /lib/libprop.so.1...done.
Loaded symbols for /lib/libprop.so.1
Reading symbols from /lib/libc.so.12...done.
Loaded symbols for /lib/libc.so.12

warning: Can't read pathname for load map: Input/output error.
Core was generated by `fsck_ffs'.
Program terminated with signal 11, Segmentation fault.
#0  0xbbb3c55e in asctime_r () from /lib/libc.so.12
(gdb) bt
#0  0xbbb3c55e in asctime_r () from /lib/libc.so.12
#1  0xbbb3c6b3 in asctime () from /lib/libc.so.12
#2  0xbbb39284 in __ctime50 () from /lib/libc.so.12
#3  0x0804d5e6 in pinode (ino=224938) at
/usr/src/sbin/fsck_ffs/inode.c:659
#4  0x0804ef96 in clri (idesc=0xbfbfdb44, type=0x805ec95 "UNREF", flag=1)
    at /usr/src/sbin/fsck_ffs/inode.c:572
#5  0x08054675 in pass4 () at /usr/src/sbin/fsck_ffs/pass4.c:127
#6  0x0804f18e in checkfilesys (filesys=<value optimized out>,
    origfs=0xbb901040 "/dev/rwd0e", child=<value optimized out>)
    at /usr/src/sbin/fsck_ffs/main.c:383
#7  0x0804fd57 in main (argc=Cannot access memory at address 0x0
) at /usr/src/sbin/fsck_ffs/main.c:245
(gdb)

> I'm following this tack as Plan A to rescue the VM:

> 1) fortunately, /usr/src is on a separate partition and is likely OK. 2)
I seem to be able to mount /usr itself read-only.
>    (I believe the damage is isolated in the /usr/obj subdirectory...
>     one of those theory things...)
> 3) Create a new disk with a new (empty) /usr partition
> 4) mount the old /usr on a temporary mount point read-only and
>    copy out all I can onto the new /usr, avoiding /usr/obj...
> 5) Fix up the fstab to use new partition scheme and...
> 6) hopefully boot back into multi-user.
> 7) Then try to debug the fsck dumps per directions provided...

Follow-Ups:
- Re: fsck seg fault failure on vmware -i386?
  - From: Eduardo Horvath

Prev by Date: Building pkgsrc/x11/gnome-panel fails (pkgsrc-2009Q4)
Next by Date: Re: Kernel panic in 5.99.24
Previous by Thread: Re: fsck seg fault failure on vmware -i386?
Next by Thread: Re: fsck seg fault failure on vmware -i386?
Indexes:

Home | Main Index | Thread Index | Old Index