Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: fsck seg fault failure on vmware -i386?



    Date:        Wed, 17 Feb 2010 01:51:53 +0100
    From:        Rhialto <rhialto%falu.nl@localhost>
    Message-ID:  <20100217005153.GE12520%falu.nl@localhost>

  | Well... don't leave us all in suspense... what was it???

If I tell you, what would get you to tune in again next week?

Short answer, someone put a comment, "cannot happen" in the
code, and Senator Murphy saw that, and reacted.

Longer answer (or perhaps just more rational answer...):

It is a bug in the 64 bit tzcode, using random results from
malloc() in the case that there's no timezone data files
available at all.   In any test code anyone is likely to
write, and in most simple programs, or anything that converts
(or attempts to convert) a date before doing much work, the
space returned from malloc is likely to be all nicely cleared,
and everything "works".   But in fsck_ffs, localtime() is only
called after discovering an inode that needs fixing, which may
be after lots of other inodes (and directories, indirect blocks,
etc.) have been fetched, processed, and discarded, malloc() is
quite likely to return random trash.   It is also possible that
running under vmware might make that random trash different than
in other environments (which might be why no-one else has seen
this problem - this can't have been the first time someone running
current had a filesystem they needed to repair in single user mode
(without /usr/* visible) - can it?)

Since this only happens (only affects anything) in the 64 bit code
(other important data was being explicitly initialised), it will
only affect current (NetBSD 5 and before are still 32 bit time_t's
I believe).

The updated tzcode released yesterday does not have a fix for this in
it, it does have a fix to prevent core dumps from asctime() when
localtime returns NULL (hopefully eventually only for legitimate
reasons, which means enormously big (positive or negative) year values
which will never bother any ordinary use of localtime()).  I suggested
a simple fix (which just meant moving the place where the data that
ended up uninitialised was being initialised, to earlier in the code
so it couldn't be skipped - I did also send a cc (bcc actually) of that
to Christos who has been dealing with 64 bit time_t issued in NetBSD.

The actual fix added to the tzcode (most likely to appear in tzcode2010b,
probably next Monday, but that is not yet confirmed) is likely to be
more conservative, and simply clear everything, always - make sure there
isn't some other internal state that fails to get initialised the same
way that no-one has discovered yet.

If you want to investigate the code yourself, look for references to
sp->goback (and sp->goahead) and where those get set, and the sp->ats array,
and look at localsub() and see how they're used.   If after the malloc()
that creates the state struct (eg: in tzsetwall_unlocked()) you deliberately
make lclptr->ats[0] to some very big positive number (its 64 bits, so push
it out there...) and lclptr->goback to 1 (anything != 0), and then arrange
for there to be no accessible zonetab files (not even "GMT"), you'll see a
NULL from localtime() too (regardless of what time it is being asked to
convert - well, the time must be less than what you put in ats[0]).
The "cannot happen" exit from localsub() is the one that happens...

kre



Home | Main Index | Thread Index | Old Index