tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

exact semantics of union mounts (and TRYEMULROOT)

What precisely are the semantics of directory operations on union
mounts supposed to be? (Note: that's mount -o union, not unionfs,
which is mount -t union.)

As some may remember, the chief goal of the namei rototilling that's
now been going on ~forever was to simplify how directory operations
interact with namei and the vnode interface. Things have gotten to the
point where material progress on that is now possible; however, it's
important to implement all the bits andcorner cases correctly.

Union mounts are complicated in this regard because when the directory
involved is a union mount point, some layer of the union mount needs
to be chosen to invoke the filesystem-level operation; and in some
cases it might need to be tried repeatedly or at more than one layer
before giving up.

TRYEMULROOT is similar in that ideally (when the directory in question
exists both in the emulation root and the regular root) it would
behave the same way. (As the implementation is quite different that
may not be practical, but I feel like we shouldn't begin by aiming

The current behavior of these operations on union mount points is not
necessarily relevant because after reviewing things I'm fairly certain
that in at least some cases it's wrong.

Directory operations can be divided into five categories:
 - lookup (ordinary directory traversal, operations like stat, open
      without O_CREATE, etc.)
 - nonexclusive create (open without O_CREATE)
 - exclusive create (mkdir, symlink, open with O_CREATE|O_EXCL, etc.)
 - remove (rmdir, unlink)
 - rename

So I think these should behave as follows:

For lookup, we should start at the top of the union stack and try
looking up the target name, and descend until either we find it in
some layer or run out of layers. This much is pretty clear.

For nonexclusive create, we should do the same, and if we run out of
layers start at the top again and, knowing that the name doesn't
exist, continue like an exclusive create. This requires not unlocking
the directory in between so that the proposition "the name doesn't
exist" remains true.

For an exclusive create, however, we need to ascertain that the name
doesn't exist before we try creating anything. Various security
properties depend on exclusive create actually being exclusive, and I
don't think having union mounts weaken this is healthy. So I think we
need to test all layers before creating anything. (It also means we
need to lock all layers, not just one at a time, which we don't
currently do and is currently problematic, but that's a topic for

Once we've ascertained that the name doesn't exist, we use the topmost
read-write layer; that is, start at the top and descend skipping
layers that are tagged readonly. (But: does this strictly mean
readonly as in EROFS, or do we skip layers that are chmod -w for the
current user?)

For remove, I think the correct thing to do is to descend until we
find the topmost layer where the target name exists, if any, and then
operate at that layer.

And for rename, I think the correct thing is to make like remove on
the first (from-dir) argument, then find whatever layer in the second
(to-dir) argument is the same volume, regardless of stack order. This
can result in moving a file under another file, but that's what Plan 9
does and I guess it's ok. (I guess if there's more than one instance
of the same volume in the to-dir union stack, which is not impossible
with rebind mounts if we ever implement that, it should use the
topmost one.)

Plan 9 has a mount flag (mount -c) that it uses to pick the layer
where new objects get created, rather than going by readonly vs.
read-write; we don't have that but could implement it.

Does this seem reasonable?

David A. Holland

Home | Main Index | Thread Index | Old Index