Subject: tmpfs: Sharing pools across filesystems or not?
To: None <tech-kern@netbsd.org>
From: Julio M. Merino Vidal <jmmv84@gmail.com>
List: tech-kern
Date: 08/16/2005 17:52:07
Hi all,

up until now, tmpfs has been using two pools to allocate file-system
meta data: one for nodes and one for directory entries.  These pools
were shared among all mounted file-systems, because I saw no
reason not to do so.  But now I'm starting to have doubts about this
approach; in fact, I tend to think that keeping these pools separate
for each file-system is better.

I'm writing this mail because I got suggestions from other developers
(during the initial stages of tmpfs development) asking if I could share
the resources among multiple instances of tmpfs.  This is to make
sure that I don't overlook anything, given that they'll probably know
better than me what's going on behind pools.

The disadvantage I can see in sharing the pools is that we need to
keep two different pools for each file-system (stored in their mount
structure).  However, I don't think this is serious, because the cost to
do so is cheap.  There may be other disadvantages, but I'm not aware
of them.  If so, please share.

And now come the advantages that make me think that keeping
them separate is better.  In the examples below, I'll use the node
pool and assume that each memory page can store a maximum of
four entries to do the calculations.

- It is easy to control the real memory usage of each file-system
  by counting the allocated/freed pages.  This can be done
  accurately from within functions of the custom allocator[1].

  At the moment, the file-system counts its space usage by adding
  or subtracting the amount of memory taken for each node, not
  by the amount of allocated pages.  This is dangerous, because
  tmpfs cannot correctly control the size limits of the mount point.

  Suppose you have two file-systems mounted.  After the creation
  of several nodes, you could end up with each page holding a node
  from the first file-system and three nodes from the second one.
  After doing this, each file-system reports a memory usage that is
  correct with what has been allocated: the second file-system takes
  three times the space of the first one, and the sum of the two,
  rounded up to the page alignment, matches the amount of pages
  really allocated.

  Now, you remove all the nodes from the second file-system, and
  it will report a memory usage of 0 bytes.  The first one, instead,
  will report an amount equal to the page size * the number of nodes.
  Why is this wrong?  Because 75% of the space taken by all the
  allocated memory pages is not accounted in any of the two
  file-systems.  When it's time for the file-system to decide if it has
  enough free space according to its limits, it can take a wrong
  decision, using more memory than the user initially expected.

- The entries for a single file-system are kept together in memory
  (e.g., a page holds items created by a specific file-system).

  Continuing the previous example, a file-system could be using
  25% of each allocated page.  If we need to access several nodes
  at once from that specific fs (something that happens frequently,
  specially for dirents), we will need (in the worst case scenario) to
  touch fourth times the the number of pages we'd ideally need if all
  the entries were packed together.  This will cause more page and
  cache faults, thus reducing performance.

Do you think these are valid points to change the existing code?
Do you see any other advantage/disadvantage?

Thanks,

[1] Pools use wired memory, so I'll have to write a custom memory
allocator anyway to avoid that.

--=20
Julio M. Merino Vidal <jmmv84@gmail.com>
http://www.livejournal.com/users/jmmv/
The NetBSD Project - http://www.NetBSD.org/