NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/47739: tmpfs panic: kernel diagnostic assertion "(node)->tn_spec.tn_dir.tn_readdir_lastp == NULL..."



On 9 Oct, 2013, at 16:20 , David Laight <david%l8s.co.uk@localhost> wrote:
> The following reply was made to PR kern/47739; it has been noted by GNATS.
> 
> From: David Laight <david%l8s.co.uk@localhost>
> To: David Holland <dholland-bugs%netbsd.org@localhost>
> Cc: gnats-bugs%NetBSD.org@localhost
> Subject: Re: kern/47739: tmpfs panic: kernel diagnostic assertion 
> "(node)->tn_spec.tn_dir.tn_readdir_lastp == NULL..."
> Date: Wed, 9 Oct 2013 21:27:38 +0100
> 
> On Mon, Oct 07, 2013 at 07:42:23AM +0000, David Holland wrote:
>> On Thu, Aug 08, 2013 at 08:30:01PM +0000, Mindaugas Rasiukevicius wrote:
>>>> Can someone please take a look?
>> 
>> almost certainly the same as 47480 (and 41068)
>> 
>>>>  Thomas
>>> 
>>> Most likely this is due to tmpfs_dircookie() truncation here:
>>> 
>>> http://nxr.netbsd.org/source/xref/src/sys/fs/tmpfs/tmpfs.h#88
>>> 
>>> It is wrong and there are other PRs because of it.  Thomas, you can test
>>> this by replacing the function body with the following:
>>> 
>>>     return (off_t)(uintptr_t)de;
>>> 
>>> This breaks linux32 compat, but we really just need to decide how we want
>>> to fix it (it would be good to avoid penalising the native code, but better
>>> to penalise than fail).
>> 
>> Four and a half years ago (in PR 41068) I asked why tmpfs does this
>> nonsense instead of just assigning sequence numbers to each node.
>> Nobody has ever managed to come up with a coherent justification, just
>> FUD.
> 
> If it assigned sequence numbers it would have to check for already used
> values once it had created 2^32 entries (assuming it needs to generate
> 32bit offsets).
> 
> Something based on the algorithm used to look up process ids might be
> better.

Assuming my mail client doesn't ruin it I've attached a patch which adds
a sequence number to each tmpfs directory entry.  The sequence numbers
are kept sorted in the directory entry list TAILQ order because it doesn't
cost anything much to do that in the current code and having a file offset with
actual ordering semantics fixes some things that the current code can do wrong
when directories are being read, like the EINVAL error that getdents(2) says
only NFS file systems are supposed to return but which appears in the code
here too.

The sequence algorithm is minimally simple.  It tracks the last entry added
and attempts to add the next entry after it with a sequence number 1 higher,
incrementing past anything already using the sequence number until it finds a
free spot and wrapping to zero when it gets to the end of the sequence space.
If the last entry added is removed it backs up and reuses the numbers right
away.  The first time through the sequence space it never has to skip anything,
but you are right that subsequent passes through it cost more.  On the other
hand, the total additional cost for a subsequent full pass through the sequence
space is about the same as a single, unsuccessful name search in the directory
with the O(n) search algorithm it uses now.  If it does a name search like that
before adding a new directory entry (I'm not positive it does since I
don't understand the name caching, but caches often can't help with names
that don't exist) the amortised necessary cost increase would be in the
fractional parts per billion.  The absolute amount of work this involves
is tiny if the number of directory entries is small compared to the
sequence space.  My guess is that the sequence numbers may not matter in
this case, if there are performance issues the O(n) name search is the
long pole in the tent.

It does read sequence numbers out of the bracketing directory entries during
name adds when it could instead easily cache them in the directory inode.  I
didn't do that because I didn't want to increase the inode size, and I suspect
the directory entries are being read by a name search anyway so the data it
needs is likely to be in the processor cache already.  If that isn't true this
could be easily fixed.

This was done after I got a crash like this PR while trying to see if I
could speed up builds this way.  I ran it for a while with TMPFS_DIRCOOKIE_EOF
set much smaller, it is actually quite difficult to get the 2^31
sequence space to roll over under normal use (it takes a long time for
a C program explicitly designed to make it roll over to accomplish that).
I didn't like the memory address cookie thing because it was too easy to
think up things it could do that are wrong, even if they probably wouldn't
happen.  The code below may not be bug free, but it does the same few
things over and over without the possibility of exceptions that are out
of your control so it might be possible to get the bugs out of it.

Dennis Ferguson


Index: sys/fs/tmpfs/tmpfs.h
===================================================================
RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs.h,v
retrieving revision 1.45
diff -u -r1.45 tmpfs.h
--- sys/fs/tmpfs/tmpfs.h        27 Sep 2011 01:10:43 -0000      1.45
+++ sys/fs/tmpfs/tmpfs.h        11 Oct 2013 18:37:27 -0000
@@ -44,6 +44,11 @@
 #include <sys/vnode.h>
 
 /*
+ * Type of a directory entry sequence number.  See TMPFS_DIRCOOKIE_EOF.
+ */
+typedef        uint32_t        tmpfs_seq_t;
+
+/*
  * Internal representation of a tmpfs directory entry.
  *
  * All fields are protected by vnode lock.
@@ -54,9 +59,12 @@
        /* Pointer to the inode this entry refers to. */
        struct tmpfs_node *             td_node;
 
+       /* Sequence in directory */
+       tmpfs_seq_t                     td_seq;
+
        /* Name and its length. */
-       char *                          td_name;
        uint16_t                        td_namelen;
+       char *                          td_name;
 } tmpfs_dirent_t;
 
 TAILQ_HEAD(tmpfs_dir, tmpfs_dirent);
@@ -67,36 +75,29 @@
 /* Validate maximum td_namelen length. */
 CTASSERT(TMPFS_MAXNAMLEN < UINT16_MAX);
 
-#define        TMPFS_DIRCOOKIE_DOT     0
-#define        TMPFS_DIRCOOKIE_DOTDOT  1
-#define        TMPFS_DIRCOOKIE_EOF     2
-
 /*
- * Each entry in a directory has a cookie that identifies it.  Cookies
- * supersede offsets within directories, as tmpfs has no offsets as such.
+ * Each entry in a directory has a sequence number that identifies it.
+ * The directory entries are kept sorted into ascending sequence order.
+ * This acts as a stand-in for offset; it provides ordering for sequential
+ * directory reads.
  *
- * The '.', '..' and the end of directory markers have fixed cookies,
- * which cannot collide with the cookies generated by other entries.
+ * The '.' and '..' entries have fixed sequence numbers (and no actual
+ * directory entries).  Directory entries have a sequence number greater
+ * than those but less than TMPFS_DIRCOOKIE_EOF.  The latter is set
+ * to 2^31-1 to avoid Linux compat problems, see PR32034, and the type
+ * of tmpfs_seq_t is set to uint32_t to match.  For no Linux compatibility
+ * and huge directories make tmpfs_seq_t an off_t and _EOF a much larger
+ * number.
  *
- * The cookies for the other entries are generated based on the memory
- * address of their representative meta-data structure.
- *
- * XXX: Truncating directory cookies to 31 bits now - workaround for
- * problem with Linux compat, see PR/32034.
+ * We call the sequence numbers "cookies" since the old code used the name
+ * and they are used to fill in the cookie fields for NFS directory reads.
  */
-static inline off_t
-tmpfs_dircookie(tmpfs_dirent_t *de)
-{
-       off_t cookie;
-
-       cookie = ((off_t)(uintptr_t)de >> 1) & 0x7FFFFFFF;
-       KASSERT(cookie != TMPFS_DIRCOOKIE_DOT);
-       KASSERT(cookie != TMPFS_DIRCOOKIE_DOTDOT);
-       KASSERT(cookie != TMPFS_DIRCOOKIE_EOF);
+#define        TMPFS_DIRCOOKIE_DOT     0
+#define        TMPFS_DIRCOOKIE_DOTDOT  1
+#define        TMPFS_DIRCOOKIE_MIN     2               /* min td_seq */
+#define        TMPFS_DIRCOOKIE_EOF     0x7fffffff      /* max td_seq */
 
-       return cookie;
-}
-#endif
+#endif /* defined(_KERNEL) */
 
 /*
  * Internal representation of a tmpfs file system node -- inode.
@@ -169,12 +170,14 @@
                        /* List of directory entries. */
                        struct tmpfs_dir        tn_dir;
 
+                       /* Pointer to insertion point for new entries */
+                       struct tmpfs_dirent *   tn_insert;
+
                        /*
-                        * Number and pointer of the last directory entry
+                        * Pointer to the last directory entry
                         * returned by the readdir(3) operation.
                         */
-                       off_t                   tn_readdir_lastn;
-                       struct tmpfs_dirent *   tn_readdir_lastp;
+                       struct tmpfs_dirent *   tn_readdir_last;
                } tn_dir;
 
                /* Type case: VLNK. */
@@ -278,7 +281,7 @@
 
 int            tmpfs_dir_getdotdent(tmpfs_node_t *, struct uio *);
 int            tmpfs_dir_getdotdotdent(tmpfs_node_t *, struct uio *);
-tmpfs_dirent_t *tmpfs_dir_lookupbycookie(tmpfs_node_t *, off_t);
+tmpfs_dirent_t *tmpfs_dir_getnext(tmpfs_node_t *, off_t);
 int            tmpfs_dir_getdents(tmpfs_node_t *, struct uio *, off_t *);
 
 int            tmpfs_reg_resize(vnode_t *, off_t);
@@ -324,9 +327,6 @@
 #define TMPFS_VALIDATE_DIR(node) \
     KASSERT((node)->tn_type == VDIR); \
     KASSERT((node)->tn_size % sizeof(tmpfs_dirent_t) == 0); \
-    KASSERT((node)->tn_spec.tn_dir.tn_readdir_lastp == NULL || \
-        tmpfs_dircookie((node)->tn_spec.tn_dir.tn_readdir_lastp) == \
-        (node)->tn_spec.tn_dir.tn_readdir_lastn);
 
 /*
  * Memory management stuff.
Index: sys/fs/tmpfs/tmpfs_subr.c
===================================================================
RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs_subr.c,v
retrieving revision 1.80
diff -u -r1.80 tmpfs_subr.c
--- sys/fs/tmpfs/tmpfs_subr.c   4 Oct 2013 15:14:11 -0000       1.80
+++ sys/fs/tmpfs/tmpfs_subr.c   11 Oct 2013 18:37:27 -0000
@@ -155,8 +155,8 @@
                /* Directory. */
                TAILQ_INIT(&nnode->tn_spec.tn_dir.tn_dir);
                nnode->tn_spec.tn_dir.tn_parent = NULL;
-               nnode->tn_spec.tn_dir.tn_readdir_lastn = 0;
-               nnode->tn_spec.tn_dir.tn_readdir_lastp = NULL;
+               nnode->tn_spec.tn_dir.tn_insert = NULL;
+               nnode->tn_spec.tn_dir.tn_readdir_last = NULL;
 
                /* Extra link count for the virtual '.' entry. */
                nnode->tn_links++;
@@ -439,6 +439,53 @@
 }
 
 /*
+ * tmpfs_dir_newinsert: find a spot in the directory where
+ * there is free sequence number space so that nodes can be
+ * inserted.
+ *
+ * => Scan the directory entries forward from the current
+ *    insert point looking for an unused sequence number.
+ * => Return the predecessor node to the unused spot.  This
+ *    might be NULL if TMPFS_DIRCOOKIE_MIN is not in use.
+ * => Stop searching as soon as we find even a single empty
+ *    spot.  If our sequence space is size S and the number
+ *    of directory entries is N then the cost of doing (S - N)
+ *    tmpfs_dir_attach() operations will be scanning N directory
+ *    entries (or about the same as two name lookups given the
+ *    O(N) algorithm used here).
+ */
+static tmpfs_dirent_t *
+tmpfs_dir_newinsert(tmpfs_node_t *node)
+{
+       tmpfs_dirent_t *de, *de_next, *de_first;
+       tmpfs_seq_t seq, seq_next;
+
+       /* If we're called there's no space here, so seq 1 past first */
+       de_first = node->tn_spec.tn_dir.tn_insert;
+       seq_next = de_first->td_seq + 1;
+       de_next = TAILQ_NEXT(de_first, td_entries);
+
+       do {
+               KASSERT(de_next != de_first);   /* all seqno's used?!? */
+               de = de_next;
+               if (de == NULL) {
+                       seq = TMPFS_DIRCOOKIE_MIN + 1;
+                       de_next = TAILQ_FIRST(&node->tn_spec.tn_dir.tn_dir);
+               } else {
+                       seq = seq_next;
+                       de_next = TAILQ_NEXT(de, td_entries);
+               }
+               if (de_next == NULL) {
+                       seq_next = TMPFS_DIRCOOKIE_EOF - 1;
+               } else {
+                       seq_next = de_next->td_seq;
+               }
+       } while (seq >= seq_next);
+
+       return de;
+}
+
+/*
  * tmpfs_dir_attach: associate directory entry with a specified inode,
  * and attach the entry into the directory, specified by vnode.
  *
@@ -451,6 +498,8 @@
 tmpfs_dir_attach(vnode_t *dvp, tmpfs_dirent_t *de, tmpfs_node_t *node)
 {
        tmpfs_node_t *dnode = VP_TO_TMPFS_DIR(dvp);
+       tmpfs_dirent_t *de_prev, *de_next;
+       tmpfs_seq_t new_seq;
        int events = NOTE_WRITE;
 
        KASSERT(VOP_ISLOCKED(dvp));
@@ -465,8 +514,35 @@
                node->tn_dirent_hint = de;
        }
 
-       /* Insert the entry to the directory (parent of inode). */
-       TAILQ_INSERT_TAIL(&dnode->tn_spec.tn_dir.tn_dir, de, td_entries);
+       /* Find the insertion point.  Make sure we have a sequence space. */
+       de_prev = dnode->tn_spec.tn_dir.tn_insert;
+       if (de_prev == NULL) {
+               new_seq = TMPFS_DIRCOOKIE_MIN;
+       } else {
+               new_seq = de_prev->td_seq + 1;
+               de_next = TAILQ_NEXT(de_prev, td_entries);
+               if (new_seq >= TMPFS_DIRCOOKIE_EOF ||
+                   (de_next != NULL && new_seq >= de_next->td_seq)) {
+                       de_prev = tmpfs_dir_newinsert(dnode);
+                       if (de_prev == NULL) {
+                               new_seq = TMPFS_DIRCOOKIE_MIN;
+                       } else {
+                               new_seq = de_prev->td_seq + 1;
+                       }
+               }
+       }
+
+       /* Insert the entry into directory after de_prev (or at head) */
+       de->td_seq = new_seq;
+       if (de_prev) {
+               TAILQ_INSERT_AFTER(&dnode->tn_spec.tn_dir.tn_dir,
+                                  de_prev, de, td_entries);
+       } else {
+               TAILQ_INSERT_HEAD(&dnode->tn_spec.tn_dir.tn_dir,
+                                 de, td_entries);
+       }
+       dnode->tn_spec.tn_dir.tn_insert = de;
+
        dnode->tn_size += sizeof(tmpfs_dirent_t);
        dnode->tn_status |= TMPFS_NODE_STATUSALL;
        uvm_vnp_setsize(dvp, dnode->tn_size);
@@ -528,9 +604,12 @@
        }
 
        /* Remove the entry from the directory. */
-       if (dnode->tn_spec.tn_dir.tn_readdir_lastp == de) {
-               dnode->tn_spec.tn_dir.tn_readdir_lastn = 0;
-               dnode->tn_spec.tn_dir.tn_readdir_lastp = NULL;
+       if (dnode->tn_spec.tn_dir.tn_readdir_last == de) {
+               dnode->tn_spec.tn_dir.tn_readdir_last = NULL;
+       }
+       if (dnode->tn_spec.tn_dir.tn_insert == de) {
+               dnode->tn_spec.tn_dir.tn_insert =
+                   TAILQ_PREV(de, tmpfs_dir, td_entries);
        }
        TAILQ_REMOVE(&dnode->tn_spec.tn_dir.tn_dir, de, td_entries);
 
@@ -620,7 +699,7 @@
        else {
                error = uiomove(dentp, dentp->d_reclen, uio);
                if (error == 0)
-                       uio->uio_offset = TMPFS_DIRCOOKIE_DOTDOT;
+                       uio->uio_offset = TMPFS_DIRCOOKIE_DOT + 1;
        }
        node->tn_status |= TMPFS_NODE_ACCESSED;
        kmem_free(dentp, sizeof(struct dirent));
@@ -654,13 +733,7 @@
        else {
                error = uiomove(dentp, dentp->d_reclen, uio);
                if (error == 0) {
-                       tmpfs_dirent_t *de;
-
-                       de = TAILQ_FIRST(&node->tn_spec.tn_dir.tn_dir);
-                       if (de == NULL)
-                               uio->uio_offset = TMPFS_DIRCOOKIE_EOF;
-                       else
-                               uio->uio_offset = tmpfs_dircookie(de);
+                       uio->uio_offset = TMPFS_DIRCOOKIE_DOTDOT + 1;
                }
        }
        node->tn_status |= TMPFS_NODE_ACCESSED;
@@ -669,23 +742,31 @@
 }
 
 /*
- * tmpfs_dir_lookupbycookie: lookup a directory entry by associated cookie.
+ * tmpfs_dir_getnext: find an entry with a sequence >= cookie
  */
 tmpfs_dirent_t *
-tmpfs_dir_lookupbycookie(tmpfs_node_t *node, off_t cookie)
+tmpfs_dir_getnext(tmpfs_node_t *node, off_t cookie)
 {
        tmpfs_dirent_t *de;
+       tmpfs_seq_t next_seq;
 
        KASSERT(VOP_ISLOCKED(node->tn_vnode));
 
-       if (cookie == node->tn_spec.tn_dir.tn_readdir_lastn &&
-           node->tn_spec.tn_dir.tn_readdir_lastp != NULL) {
-               return node->tn_spec.tn_dir.tn_readdir_lastp;
+       if (cookie >= TMPFS_DIRCOOKIE_EOF || cookie < TMPFS_DIRCOOKIE_DOT) {
+               return NULL;
        }
-       TAILQ_FOREACH(de, &node->tn_spec.tn_dir.tn_dir, td_entries) {
-               if (tmpfs_dircookie(de) == cookie) {
+       next_seq = (tmpfs_seq_t) cookie;
+
+       de = node->tn_spec.tn_dir.tn_readdir_last;
+       if (de == NULL || de->td_seq > next_seq) {
+               de = TAILQ_FIRST(&node->tn_spec.tn_dir.tn_dir);
+       }
+
+       while (de) {
+               if (de->td_seq >= next_seq) {
                        break;
                }
+               de = TAILQ_NEXT(de, td_entries);
        }
        return de;
 }
@@ -699,7 +780,7 @@
 int
 tmpfs_dir_getdents(tmpfs_node_t *node, struct uio *uio, off_t *cntp)
 {
-       tmpfs_dirent_t *de;
+       tmpfs_dirent_t *de, *last_de;
        struct dirent *dentp;
        off_t startcookie;
        int error;
@@ -715,17 +796,14 @@
        startcookie = uio->uio_offset;
        KASSERT(startcookie != TMPFS_DIRCOOKIE_DOT);
        KASSERT(startcookie != TMPFS_DIRCOOKIE_DOTDOT);
-       if (startcookie == TMPFS_DIRCOOKIE_EOF) {
-               return 0;
-       } else {
-               de = tmpfs_dir_lookupbycookie(node, startcookie);
-       }
+       de = tmpfs_dir_getnext(node, startcookie);
        if (de == NULL) {
-               return EINVAL;
+               return 0;
        }
+       last_de = NULL;         /* track last entry written */
 
        /*
-        * Read as much entries as possible; i.e., until we reach the end
+        * Read as many entries as possible; i.e., until we reach the end
         * of the directory or we exhaust uio space.
         */
        dentp = kmem_alloc(sizeof(struct dirent), KM_SLEEP);
@@ -783,20 +861,26 @@
                 * advance pointers.
                 */
                error = uiomove(dentp, dentp->d_reclen, uio);
+               if (error != 0) {
+                       break;
+               }
 
+               /*
+                * At this point we have successfully written de.  Keep
+                * track the last successfully written entry in last_de.
+                */
                (*cntp)++;
-               de = TAILQ_NEXT(de, td_entries);
-       } while (error == 0 && uio->uio_resid > 0 && de != NULL);
+               last_de = de;
+       } while (uio->uio_resid > 0 &&
+                (de = TAILQ_NEXT(de, td_entries)) != NULL);
 
        /* Update the offset and cache. */
        if (de == NULL) {
                uio->uio_offset = TMPFS_DIRCOOKIE_EOF;
-               node->tn_spec.tn_dir.tn_readdir_lastn = 0;
-               node->tn_spec.tn_dir.tn_readdir_lastp = NULL;
-       } else {
-               node->tn_spec.tn_dir.tn_readdir_lastn = uio->uio_offset =
-                   tmpfs_dircookie(de);
-               node->tn_spec.tn_dir.tn_readdir_lastp = de;
+               node->tn_spec.tn_dir.tn_readdir_last = NULL;
+       } else if (last_de) {
+               uio->uio_offset = last_de->td_seq + 1;
+               node->tn_spec.tn_dir.tn_readdir_last = last_de;
        }
        node->tn_status |= TMPFS_NODE_ACCESSED;
        kmem_free(dentp, sizeof(struct dirent));
Index: sys/fs/tmpfs/tmpfs_vnops.c
===================================================================
RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs_vnops.c,v
retrieving revision 1.103
diff -u -r1.103 tmpfs_vnops.c
--- sys/fs/tmpfs/tmpfs_vnops.c  4 Oct 2013 15:14:11 -0000       1.103
+++ sys/fs/tmpfs/tmpfs_vnops.c  11 Oct 2013 18:37:27 -0000
@@ -995,21 +995,21 @@
        *ncookies = cnt;
 
        for (i = 0; i < cnt; i++) {
-               KASSERT(off != TMPFS_DIRCOOKIE_EOF);
+               KASSERT(off < uio->uio_offset);
                if (off != TMPFS_DIRCOOKIE_DOT) {
                        if (off == TMPFS_DIRCOOKIE_DOTDOT) {
                                de = TAILQ_FIRST(&node->tn_spec.tn_dir.tn_dir);
                        } else if (de != NULL) {
                                de = TAILQ_NEXT(de, td_entries);
                        } else {
-                               de = tmpfs_dir_lookupbycookie(node, off);
+                               de = tmpfs_dir_getnext(node, off);
                                KASSERT(de != NULL);
                                de = TAILQ_NEXT(de, td_entries);
                        }
                        if (de == NULL) {
-                               off = TMPFS_DIRCOOKIE_EOF;
+                               off = uio->uio_offset;
                        } else {
-                               off = tmpfs_dircookie(de);
+                               off = de->td_seq;
                        }
                } else {
                        off = TMPFS_DIRCOOKIE_DOTDOT;



Home | Main Index | Thread Index | Old Index