current-users: Re: supervisor trap page fault in lfs

Subject: Re: supervisor trap page fault in lfs_putpages
To: None <current-users@NetBSD.org>
From: Paul Ripke <stix@stix.id.au>
List: current-users
Date: 12/01/2006 22:28:07

The saga continues...

As suggested, I tried a dump and restore, with a
"newfs_lfs -A /dev/rld0g" in between.

I managed to get a hang during the restore... system would ping,
and I could switch text VTs, but thats all. So I jumped into ddb
got a backtrace, continued, back to ddb, backtrace, etc, about 20
times. Over about a 10 minute period - during which the system
did not appear to make any forward progress ("systat vm 1" in
another VT didn't budge).

Every backtrace had lfs_writer -> lfs_flush_pchain(), offsets
between +0x110 and +0x127, which corresponds to the following
from lfs_vnops.c according to gdb:

        /*
         * lfs_writevnodes, optimized to clear pageout requests.
         * Only write non-dirop files that are in the pageout queue.
         * We're very conservative about what we write; we want to be
         * fast and async.
         */
        simple_lock(&fs->lfs_interlock);
    top:
0x110   for (ip = TAILQ_FIRST(&fs->lfs_pchainhd); ip != NULL; ip = nip) {
                nip = TAILQ_NEXT(ip, i_lfs_pchain);
                vp = ITOV(ip);

                if (!(ip->i_flags & IN_PAGING))
0x127                   goto top;

                if (vp->v_flag & (VXLOCK|VDIROP))
                        continue;

I don't pretend to understand the guts of this, but is it possible
it was stuck hitting the goto every time? As previously, have core,
have netbsd.gdb. Oh, and I tried this twice, so it wasn't a fluke.
And nada from google, and no PR that I could find.

-- 
stix