Source-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[src/trunk]: src/sys/ufs * Remove PGO_RECLAIM during lfs_putpages()' call to ...



details:   https://anonhg.NetBSD.org/src/rev/4ba15dcbd102
branches:  trunk
changeset: 772414:4ba15dcbd102
user:      perseant <perseant%NetBSD.org@localhost>
date:      Mon Jan 02 22:10:44 2012 +0000

description:
* Remove PGO_RECLAIM during lfs_putpages()' call to genfs_putpages(),
  to avoid a live lock in the latter when reclaiming a vnode with
  dirty pages.

* Add a new segment flag, SEGM_RECLAIM, to note when a segment is
  being written for vnode reclamation, and record which inode is being
  reclaimed, to aid in forensic debugging.

* Add a new segment flag, SEGM_SINGLE, so that opportunistic writes
  can write a single segment's worth of blocks and then stop, rather
  than writing all the way up to the cleaner's reserved number of
  segments.

* Add assert statements to check mutex ownership is the way it ought
  to be, mostly in lfs_putpages; fix problems uncovered by this.

* Don't clear VU_DIROP until the inode actually makes its way to disk,
  avoiding a problem where dirop inodes could become separated
  (uncovered by a modified version of the "ckckp" forensic regression
  test).

* Move the vfs_getopsbyname() call into lfs_writerd.  Prepare code to
  make lfs_writerd notice when there are no more LFSs, and exit losing
  the reference, so that, in theory, the module can be unloaded.  This
  code is not enabled, since it causes a crash on exit.

* Set IN_MODIFIED on inodes flushed by lfs_flush_dirops.  Really we
  only need to set IN_MODIFIED if we are going to write them again
  (e.g., to write pages); need to think about this more.

Finally, several changes to help avoid "no clean segments" panics:

* In lfs_bmapv, note when a vnode is loaded only to discover whether
  its blocks are live, so it can immediately be recycled.  Since the
  cleaner will try to choose ~empty segments over full ones, this
  prevents the cleaner from (1) filling the vnode cache with junk, and
  (2) squeezing any unwritten writes to disk and running the fs out of
  segments.

* Overestimate by half the amount of metadata that will be required
  to fill the clean segments.  This will make the disk appear smaller,
  but should help avoid a "no clean segments" panic.

* Rearrange lfs_writerd.  In particular, lfs_writerd now pays
  attention to the number of clean segments available, and holds off
  writing until there is room.

diffstat:

 sys/ufs/lfs/lfs.h           |   39 ++++--
 sys/ufs/lfs/lfs_bio.c       |   27 +++-
 sys/ufs/lfs/lfs_extern.h    |    6 +-
 sys/ufs/lfs/lfs_segment.c   |   76 ++++++++++--
 sys/ufs/lfs/lfs_subr.c      |   11 +-
 sys/ufs/lfs/lfs_syscalls.c  |   19 ++-
 sys/ufs/lfs/lfs_vfsops.c    |  245 +++++++++++++++++++++++++++--------------
 sys/ufs/lfs/lfs_vnops.c     |  258 ++++++++++++++++++++++++++++++++-----------
 sys/ufs/ufs/inode.h         |    4 +-
 sys/ufs/ufs/ufs_readwrite.c |    5 +-
 10 files changed, 483 insertions(+), 207 deletions(-)

diffs (truncated from 1604 to 300 lines):

diff -r f62c09f2f9d7 -r 4ba15dcbd102 sys/ufs/lfs/lfs.h
--- a/sys/ufs/lfs/lfs.h Mon Jan 02 22:02:51 2012 +0000
+++ b/sys/ufs/lfs/lfs.h Mon Jan 02 22:10:44 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: lfs.h,v 1.134 2011/07/11 08:27:40 hannken Exp $        */
+/*     $NetBSD: lfs.h,v 1.135 2012/01/02 22:10:44 perseant Exp $       */
 
 /*-
  * Copyright (c) 1999, 2000, 2001, 2002, 2003 The NetBSD Foundation, Inc.
@@ -592,6 +592,7 @@
 #define        SS_CONT         0x02            /* more partials to finish this write*/
 #define        SS_CLEAN        0x04            /* written by the cleaner */
 #define        SS_RFW          0x08            /* written by the roll-forward agent */
+#define        SS_RECLAIM      0x10            /* written by the roll-forward agent */
        u_int16_t ss_flags;             /* 24: used for directory operations */
        u_int16_t ss_pad;               /* 26: extra space */
        /* FINFO's and inode daddr's... */
@@ -608,7 +609,8 @@
        u_int16_t ss_nfinfo;            /* 20: number of file info structures */
        u_int16_t ss_ninos;             /* 22: number of inodes in summary */
        u_int16_t ss_flags;             /* 24: used for directory operations */
-       u_int8_t  ss_pad[6];            /* 26: extra space */
+       u_int8_t  ss_pad[2];            /* 26: extra space */
+       u_int32_t ss_reclino;           /* 28: inode being reclaimed */
        u_int64_t ss_serial;            /* 32: serial number */
        u_int64_t ss_create;            /* 40: time stamp */
        /* FINFO's and inode daddr's... */
@@ -840,6 +842,8 @@
        int lfs_nowrap;                 /* Suspend log wrap */
        int lfs_wrappass;               /* Allow first log wrap requester to pass */
        int lfs_wrapstatus;             /* Wrap status */
+       int lfs_reclino;                /* Inode being reclaimed */
+       int lfs_startseg;               /* Segment we started writing at */
        LIST_HEAD(, segdelta) lfs_segdhd;       /* List of pending trunc accounting events */
 };
 
@@ -945,13 +949,15 @@
        u_int32_t seg_number;           /* number of this segment */
        int32_t *start_lbp;             /* beginning lbn for this set */
 
-#define        SEGM_CKP        0x01            /* doing a checkpoint */
-#define        SEGM_CLEAN      0x02            /* cleaner call; don't sort */
-#define        SEGM_SYNC       0x04            /* wait for segment */
-#define        SEGM_PROT       0x08            /* don't inactivate at segunlock */
-#define SEGM_PAGEDAEMON        0x10            /* pagedaemon called us */
-#define SEGM_WRITERD   0x20            /* LFS writed called us */
-#define SEGM_FORCE_CKP 0x40            /* Force checkpoint right away */
+#define SEGM_CKP       0x0001          /* doing a checkpoint */
+#define SEGM_CLEAN     0x0002          /* cleaner call; don't sort */
+#define SEGM_SYNC      0x0004          /* wait for segment */
+#define SEGM_PROT      0x0008          /* don't inactivate at segunlock */
+#define SEGM_PAGEDAEMON        0x0010          /* pagedaemon called us */
+#define SEGM_WRITERD   0x0020          /* LFS writed called us */
+#define SEGM_FORCE_CKP 0x0040          /* Force checkpoint right away */
+#define SEGM_RECLAIM   0x0080          /* Writing to reclaim vnode */
+#define SEGM_SINGLE    0x0100          /* Opportunistic writevnodes */
        u_int16_t seg_flags;            /* run-time flags for this segment */
        u_int32_t seg_iocount;          /* number of ios pending */
        int       ndupino;              /* number of duplicate inodes */
@@ -992,6 +998,7 @@
 #define LFSI_DELETED      0x02
 #define LFSI_WRAPBLOCK    0x04
 #define LFSI_WRAPWAIT     0x08
+#define LFSI_BMAP         0x10
        u_int32_t lfs_iflags;           /* Inode flags */
        daddr_t   lfs_hiblk;            /* Highest lbn held by inode */
 #ifdef _KERNEL
@@ -1017,10 +1024,16 @@
  * Macros for determining free space on the disk, with the variable metadata
  * of segment summaries and inode blocks taken into account.
  */
-/* Estimate number of clean blocks not available for writing */
-#define LFS_EST_CMETA(F) (int32_t)((((F)->lfs_dmeta *                       \
-                                    (int64_t)(F)->lfs_nclean) /             \
-                                     ((F)->lfs_nseg - (F)->lfs_nclean)))
+/*
+ * Estimate number of clean blocks not available for writing because
+ * they will contain metadata or overhead.  This is calculated as
+ * (dmeta / # dirty segments) * (# clean segments).
+ */
+#define CM_MAG_NUM 3
+#define CM_MAG_DEN 2
+#define LFS_EST_CMETA(F) (int32_t)((                                   \
+                                   (CM_MAG_NUM * ((F)->lfs_dmeta * (int64_t)(F)->lfs_nclean)) / \
+                                   (CM_MAG_DEN * ((F)->lfs_nseg - (F)->lfs_nclean))))
 
 /* Estimate total size of the disk not including metadata */
 #define LFS_EST_NONMETA(F) ((F)->lfs_dsize - (F)->lfs_dmeta - LFS_EST_CMETA(F))
diff -r f62c09f2f9d7 -r 4ba15dcbd102 sys/ufs/lfs/lfs_bio.c
--- a/sys/ufs/lfs/lfs_bio.c     Mon Jan 02 22:02:51 2012 +0000
+++ b/sys/ufs/lfs/lfs_bio.c     Mon Jan 02 22:10:44 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: lfs_bio.c,v 1.120 2011/07/11 08:27:40 hannken Exp $    */
+/*     $NetBSD: lfs_bio.c,v 1.121 2012/01/02 22:10:44 perseant Exp $   */
 
 /*-
  * Copyright (c) 1999, 2000, 2001, 2002, 2003, 2008 The NetBSD Foundation, Inc.
@@ -60,7 +60,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: lfs_bio.c,v 1.120 2011/07/11 08:27:40 hannken Exp $");
+__KERNEL_RCSID(0, "$NetBSD: lfs_bio.c,v 1.121 2012/01/02 22:10:44 perseant Exp $");
 
 #include <sys/param.h>
 #include <sys/systm.h>
@@ -96,6 +96,7 @@
 int    lfs_fs_pagetrip      = 0;       /* # of pages to trip per-fs write */
 int    lfs_writing          = 0;       /* Set if already kicked off a writer
                                           because of buffer space */
+int    locked_queue_waiters = 0;       /* Number of processes waiting on lq */
 
 /* Lock and condition variables for above. */
 kcondvar_t     locked_queue_cv;
@@ -160,8 +161,12 @@
 
                lfs_flush(fs, 0, 0);
 
+               DLOG((DLOG_AVAIL, "lfs_reservebuf: waiting: count=%d, bytes=%ld\n",
+                     locked_queue_count, locked_queue_bytes));
+               ++locked_queue_waiters;
                error = cv_timedwait_sig(&locked_queue_cv, &lfs_lock,
                    hz * LFS_BUFWAIT);
+               --locked_queue_waiters;
                if (error && error != EWOULDBLOCK) {
                        mutex_exit(&lfs_lock);
                        return error;
@@ -171,8 +176,11 @@
        locked_queue_rcount += n;
        locked_queue_rbytes += bytes;
 
-       if (n < 0)
+       if (n < 0 && locked_queue_waiters > 0) {
+               DLOG((DLOG_AVAIL, "lfs_reservebuf: broadcast: count=%d, bytes=%ld\n",
+                     locked_queue_count, locked_queue_bytes));
                cv_broadcast(&locked_queue_cv);
+       }
 
        mutex_exit(&lfs_lock);
 
@@ -461,7 +469,7 @@
         */
        if (fs->lfs_ronly || (fs->lfs_pflags & LFS_PF_CLEAN)) {
                bp->b_oflags &= ~BO_DELWRI;
-               bp->b_flags |= B_READ;
+               bp->b_flags |= B_READ; /* XXX is this right? --ks */
                bp->b_error = 0;
                mutex_enter(&bufcache_lock);
                LFS_UNLOCK_BUF(bp);
@@ -535,6 +543,7 @@
        if (lfs_dostats)
                ++lfs_stats.flush_invoked;
 
+       fs->lfs_pdflush = 0;
        mutex_exit(&lfs_lock);
        lfs_writer_enter(fs, "fldirop");
        lfs_segwrite(fs->lfs_ivnode->v_mount, flags);
@@ -689,10 +698,10 @@
        /* If there are too many pending dirops, we have to flush them. */
        if (fs->lfs_dirvcount > LFS_MAX_FSDIROP(fs) ||
            lfs_dirvcount > LFS_MAX_DIROP || fs->lfs_diropwait > 0) {
-               flags |= SEGM_CKP;
-       }
-
-       if (locked_queue_count + INOCOUNT(fs) > LFS_MAX_BUFS ||
+               mutex_exit(&lfs_lock);
+               lfs_flush_dirops(fs);
+               mutex_enter(&lfs_lock);
+       } else if (locked_queue_count + INOCOUNT(fs) > LFS_MAX_BUFS ||
            locked_queue_bytes + INOBYTES(fs) > LFS_MAX_BYTES ||
            lfs_subsys_pages > LFS_MAX_PAGES ||
            fs->lfs_dirvcount > LFS_MAX_FSDIROP(fs) ||
@@ -717,8 +726,10 @@
                        ++lfs_stats.wait_exceeded;
                DLOG((DLOG_AVAIL, "lfs_check: waiting: count=%d, bytes=%ld\n",
                      locked_queue_count, locked_queue_bytes));
+               ++locked_queue_waiters;
                error = cv_timedwait_sig(&locked_queue_cv, &lfs_lock,
                    hz * LFS_BUFWAIT);
+               --locked_queue_waiters;
                if (error != EWOULDBLOCK)
                        break;
 
diff -r f62c09f2f9d7 -r 4ba15dcbd102 sys/ufs/lfs/lfs_extern.h
--- a/sys/ufs/lfs/lfs_extern.h  Mon Jan 02 22:02:51 2012 +0000
+++ b/sys/ufs/lfs/lfs_extern.h  Mon Jan 02 22:10:44 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: lfs_extern.h,v 1.96 2008/06/28 01:34:05 rumble Exp $   */
+/*     $NetBSD: lfs_extern.h,v 1.97 2012/01/02 22:10:44 perseant Exp $ */
 
 /*-
  * Copyright (c) 1999, 2000, 2001, 2002, 2003 The NetBSD Foundation, Inc.
@@ -240,8 +240,8 @@
 void lfs_gop_size(struct vnode *, off_t, off_t *, int);
 int lfs_putpages_ext(void *, int);
 int lfs_gatherpages(struct vnode *);
-void lfs_flush_dirops(struct lfs *);
-void lfs_flush_pchain(struct lfs *);
+int lfs_flush_dirops(struct lfs *);
+int lfs_flush_pchain(struct lfs *);
 
 int lfs_bwrite  (void *);
 int lfs_fsync   (void *);
diff -r f62c09f2f9d7 -r 4ba15dcbd102 sys/ufs/lfs/lfs_segment.c
--- a/sys/ufs/lfs/lfs_segment.c Mon Jan 02 22:02:51 2012 +0000
+++ b/sys/ufs/lfs/lfs_segment.c Mon Jan 02 22:10:44 2012 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: lfs_segment.c,v 1.222 2011/07/11 08:27:40 hannken Exp $        */
+/*     $NetBSD: lfs_segment.c,v 1.223 2012/01/02 22:10:44 perseant Exp $       */
 
 /*-
  * Copyright (c) 1999, 2000, 2001, 2002, 2003 The NetBSD Foundation, Inc.
@@ -60,7 +60,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: lfs_segment.c,v 1.222 2011/07/11 08:27:40 hannken Exp $");
+__KERNEL_RCSID(0, "$NetBSD: lfs_segment.c,v 1.223 2012/01/02 22:10:44 perseant Exp $");
 
 #ifdef DEBUG
 # define vndebug(vp, str) do {                                         \
@@ -202,6 +202,9 @@
        relock = 0;
 
     top:
+       KASSERT(mutex_owned(vp->v_interlock) == false);
+       KASSERT(mutex_owned(&lfs_lock) == false);
+       KASSERT(mutex_owned(&bufcache_lock) == false);
        ASSERT_NO_SEGLOCK(fs);
        if (ip->i_flag & IN_CLEANING) {
                ivndebug(vp,"vflush/in_cleaning");
@@ -280,7 +283,10 @@
        mutex_exit(vp->v_interlock);
 
        /* Protect against VI_XLOCK deadlock in vinvalbuf() */
-       lfs_seglock(fs, SEGM_SYNC);
+       lfs_seglock(fs, SEGM_SYNC | ((vp->v_iflag & VI_XLOCK) ? SEGM_RECLAIM : 0));
+       if (vp->v_iflag & VI_XLOCK) {
+               fs->lfs_reclino = ip->i_number;
+       }
 
        /* If we're supposed to flush a freed inode, just toss it */
        if (ip->i_lfs_iflags & LFSI_DELETED) {
@@ -380,11 +386,12 @@
                do {
                        if (LIST_FIRST(&vp->v_dirtyblkhd) != NULL) {
                                relock = lfs_writefile(fs, sp, vp);
-                               if (relock) {
+                               if (relock && vp != fs->lfs_ivnode) {
                                        /*
                                         * Might have to wait for the
                                         * cleaner to run; but we're
                                         * still not done with this vnode.
+                                        * XXX we can do better than this.
                                         */
                                        KDASSERT(ip->i_number != LFS_IFILE_INUM);
                                        lfs_writeinode(fs, sp, ip);
@@ -486,9 +493,16 @@
                         * After this, pages might be busy
                         * due to our own previous putpages.
                         * Start actual segment write here to avoid deadlock.
+                        * If we were just writing one segment and we've done
+                        * that, break out.
                         */
                        mutex_exit(&mntvnode_lock);
-                       (void)lfs_writeseg(fs, sp);
+                       if (lfs_writeseg(fs, sp) &&
+                           (sp->seg_flags & SEGM_SINGLE) &&
+                           fs->lfs_curseg != fs->lfs_startseg) {
+                               DLOG((DLOG_VNODE, "lfs_writevnodes: breaking out of segment write at daddr 0x%x\n", fs->lfs_offset));
+                               break;
+                       }
                        goto loop;
                }
 
@@ -626,6 +640,10 @@
         */
        do_ckp = LFS_SHOULD_CHECKPOINT(fs, flags);
 
+       /* We can't do a partial write and checkpoint at the same time. */
+       if (do_ckp)
+               flags &= ~SEGM_SINGLE;
+
        lfs_seglock(fs, flags | (do_ckp ? SEGM_CKP : 0));
        sp = fs->lfs_sp;
        if (sp->seg_flags & (SEGM_CLEAN | SEGM_CKP))
@@ -645,6 +663,11 @@
        else if (!(sp->seg_flags & SEGM_FORCE_CKP)) {
                do {
                        um_error = lfs_writevnodes(fs, mp, sp, VN_REG);
+                       if ((sp->seg_flags & SEGM_SINGLE) &&
+                           fs->lfs_curseg != fs->lfs_startseg) {
+                               DLOG((DLOG_SEG, "lfs_segwrite: breaking out of segment write at daddr 0x%x\n", fs->lfs_offset));
+                               break;
+                       }
 
                        if (do_ckp || fs->lfs_dirops == 0) {
                                if (!writer_set) {
@@ -1025,6 +1048,7 @@
 {



Home | Main Index | Thread Index | Old Index