Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: fsck seg fault failure on vmware -i386?



    Date:        Sat, 13 Feb 2010 03:40:18 -0000
    From:        yancm%sdf.lonestar.org@localhost
    Message-ID:  
<59c7f447f2f24284621a6439851baa32.squirrel%webmail.freeshell.org@localhost>

  | > Can I just put the keywords (such as DBG=-g -O2) in the Makefile?
  | > Either in src/lib/libc/Makefile or
  | > in src/lib/libc/time/Makefile for instance?
  | 
  | Apparently the answer is YES or perhaps sometimes?

Yes, you can modify the Makefiles to get the vars set - most people
prefer not to (which is why you keep getting told ways to do it differently)
as modifying the makefile is kind of crude (and you have to go and
clean it all up again later when you're finished, etc...)

But crude or not, it is easy and works ...

  | I got libc_g.a statically linked into fsck_ffs.

Great.

  | I was unsuccesful at statically linking gdb which was annoying...

Yes, but we can manage with a dynamic gdb, we need nothing special,
just for it to work.

  | I brought up the statically linked fsck_ffs in SU mode
  | and generated a coredump then brought it back into multiuser
  | and ran gdb on the core...

Thanks - at least that confirms that the core dump is what I expected
it to be, and not some other strange error.   We know (100% certain now)
that we're chasing the problem that we thought, and now you have all the
tools necessary to find out why this is happening.

  | Here's what I found - makes no sense to me...looks like
  | timeptr got munged while being passed from inode.c to asctime.c?

Ah - no, the timeptr in asctime() isn't passed from pinode() - pinode()
calls ctime() - ctime is a really simple function ...

        return asctime(localtime(timep));

Aside from the declaration and comments, that is it.   The &t from
pinode() is passed through ctime() to localtime() - the result from localtime()
is what becomes the timeptr() in asctime().

asctime() has a NULL timeptr (this is the first time we have had that
confirmed for certain, until now that was the assumption, but without
confirmation).

That means localtime() returned NULL.    It can do that in some error
situations.

The bug fixes that have been made, are so that if localtime() returns
NULL, asctime() never gets called, ctime() simply returns NULL instead.
And then in fsck, to handle the possibility of ctime() returning NULL.

The change that the original Olson timezone code (that NetBSD's libc
is based upon) will get when the fix for this is released on Tuesday,
is for asctime() (really asctime_r()) to return a constant string
(something that looks like a standard 26 byte asctime() result, but with
all the letters and digits replaced with '?' instead) if the parameter
to asctime() is NULL.   That will mean that ctime() will never return NULL,
and we don't have to go fix all the programs (like fsck_ffs) that
changed from correct to buggy when ctime()'s defiition was changed to
allow a NULL result (before 64 bit times that was never possible.)

Either way the core dump will be fixed.

The issue now, and the problem I've been hoping you can help us find,
is why localtime() is returning NULL.   What I initially assumed was
that the time being handed to localtime() (via ctime()) was one which
was out of range - now that times are 64 bits, there are values that
cannot be converted into a struct tm which is what localtime returns.
Struct tm has a 32 bit integer for the year, and if the time is really
big (positive or negative) then 32 bits just isn't enough to hold the year.

Of course, that will never happen for any time we ever care about
(unless perhaps we're concerned with some kind of astronomic research),
so normally we don't worry much about that.

But a corrupted inode could have contained nonsense data which might
make that happen - and initially that is what I was expecting to see
in your case.

But that isn't it, so ...

  | maybe I really need to run gdb in SU mode...ran out of energy tonight...

Yes, sorry - to finally lay this one to rest, we need to know why
locatime() returned NULL - by the time you get the core file, localtime()
has finished, and there's no record left as to why it did what it did.

That's why a breakpoint on the entry point to ctime() (with gdb running in
single user mode, actually running fsck_ffs - no core file) is needed,
so we can watch what happens in localtime().   Most probably a breakpoint
on localtime() would work too, but localtime() can be called from other
places, and we don't want to waste time if by some chance it happens to
be called for some other reason before the call from ctime() that matters.

If your fsck_ffs run needs to print other times, before the one that
causes the core dump, then ctime() will be called more times as well,
but by now you should have a good idea what fsck_ffs is going to do when you
run it (how many messages you will get before the core) - you can just
'c' when ctime() is called if it isn't the right call yet (that won't be
needed more than once or twice I expect, and perhaps not at all.)

Do this like
        gdb fsck_ffs
        b ctime
        run /dev/whatever

where "/dev/whatever" is whatever is the device name that you are
passing to fsck_ffs (or fsck) that causes the core dump (the bad filesystem).
If you're giving fsck_ffs other args, you can give those on the run line
as well (the only thing different is that you use "run" instead of "fsck_ffs")

Then fsck_ffs will stop immediately it calls ctime() - if you're expecting
it to print a time value before it dumps core, just 'c' and it will
continue to the next call of ctime().

No real harm if you do one too many c's, fsck_ffs will just stop with
a trap, just "run" again, and this time one less 'c' in ctime()...

When it stops in the call to ctime() that matters, a single 's' should get you
inside localtime(), gdb will print each line before executing it.

At the entry to localtime() you can
        p timep
(you should get 0xbfbfe360 (that was &t) from pinode()), and
        p *timep
(you should get 1264115119 - the value of t from pinode()).

Then you can use
        n
twice I expect, until the next line to execute is
        result = localsub(timep, 0L, &tm);

Then you need
        s
so you go down into the call of localsub().

Then you just keep on doing
        s
perhaps quite a lot of times.  You can just use "enter" (return)
after the first of them, gdb just does the last command again if
you give it an empty command.

Try not be tempted to go too fast, we need to stop doing this when
the next line to execute is "return NULL;"    That will certainly
happen sometime, perhaps directly from localsub() or perhaps from
gmtsub() or (my guess) from inside timesub()

Whatever - it doesn't matter, any "return NULL" is what we're looking
for.   Then we want to know what test was the last line executed, the
test that failed to cause the function to return NULL (the error return
from localtime()).   That probably was the last line you did 's' on before
the "return NULL" was reached (but maybe a line or two earlier is what
really matters.)

If you go one 's' past the "return NULL" I'm afraid all the work
(all those 's' or <cr>'s) is wasted, and you need to start it all
again - that's why you need to be patient.   If you see a function
call about to be made that is clearly not time related (like say
malloc() or one of the mutex functions) just use 'n' on that line
instead of 's' so you don't have to single step, line by line, through
malloc() - that would be painful (if you find yourself inside a function
like that, just do 'up' and set a breakpoint on the next line after
the call you don't want, then 'c' - there's a better way in gdb to
do that, and perhaps someone else will know what it is, but I don't...)

When it stops, since that's just after a test (every return NULL is)
and since you will have all the values available, the values of all the
variables that are tested in that test, which will have just been
executed, so we can see what happened, and work out why the perfectly normal
looking time value isn't able to be converted to a struct tm (when there's
no timezone data files available, but is when they are).   That is not
supposed to happen.   We've already found one (or two, but they're related)
bugs thanks to your experience and work - it would be nice to find the
remaining one as well.

Anything you can tell us about what's going on just before that return
NULL will help - including what the call stack is like there (just function
names should do) what line number in localtime.c the test is at, the
values of the variables used in the test that fails, and perhaps, if
you see anything "unusual" if you do
        p *tmp
when you are stopped at that "return NULL" statement (all these functions
have a tmp variable, which is the struct tm that they would return, if
nothing goes wrong).   By "unusual" I mean there's need to record and
tell us that tm_secs (for example) is 23 or something, that's just a
normal valu (the struct tm has all the fields of a date/time broken down,
one field for each, year, month, day, hour, minute, second) but if it
were -12392809283 that would be interesting!  Be aware that months there
run from 0..11 not 1..12 so 0 is an OK value.   And the year value has
1900 subtracted ("110" as the year means 2010).

You're very close now, if you can set up enough of a fake /usr/lib and
/usr/share/misc/termcap* in your single user mode environment to allow
the dynamic gdb to run there, you won't even need any more compiles at
all, just this (hopefully one) run of fsck_ffs under gdb's control
should be enough to provide all of the answers.

kre



Home | Main Index | Thread Index | Old Index