Subject: Re: Support for ACLs
To: Todd Vierling <tv@wasabisystems.com>
From: Robert Elz <kre@munnari.OZ.AU>
List: tech-kern
Date: 03/09/2001 17:58:31
    Date:        Thu, 8 Mar 2001 16:26:02 -0500 (Eastern Standard Time)
    From:        Todd Vierling <tv@wasabisystems.com>
    Message-ID:  <Pine.WNT.4.32.0103081608150.-31893-100000@xpc.int.duh.org>

  | There are many ways to address ACLs; one possibility is an ACL file, stored
  | in the fs much like a quota data file.  There's the attached files
  | suggestion.  And there's the prior art in direct inode attachment, which
  | could result in a bare minimum of code to manage their allocation without
  | full ACL support.  All are possible scenarios, and I haven't personally
  | ruled any out.

I am going to ignore all the noise from the past day on this issue,
and concentrate only on the technical issues.

The first is to notice that there are two separate problems to solve for
ACLs - first there's some data that has to be associated with a file
(either just with the files that need them, or with all files).

Then there's the actual ACL implementation itself, assuming that it has
some place to put its data (get it most often).

In the past day there have been messages that relate to both of those
issues - the messages relating to whether or not applications would need
to be updated, whether groups are good enough, all relate to the second
issue (and impact the first only in that if ACLs aren't implemented at
all then clearly there is no data to save).

That part of the subject isn't all that interesting to me personally, I
don't much care whether ACLs exist or not, I know that they're never
going to bother me if I don't want them (and you never know if they exist
I might just use them one day, perhaps).

The messages about file systems used by kernels with and without ACLs,
and ones like Todd's above concern the first issue.   That's more
interesting to me anyway...

Then of course there have been the messages (quite a lot of them) that
aren't about anything at all, those are the ones I'm ignoring.

I must admit I had never really considered storing ACLs in a file like
the quota file, the format doesn't really make that all that appealing,
though it does have some obvious advantages.  The problem would seem to
be that ACLs are not of any fixed length (usually) and can become
quite large.   And the indexing method into a big file would seen to have
to be by inode number (there isn't really anything else suitable), which
would make a simple implementation cause it to be quite difficult to
share ACLs between many files (which is usually what you want to do,
so you can update it once and everything related gets updated).

So, the file would probably need to be a file of pointers - which solves
both the variable length problem (indexing directly into an array of variable
length data structs is a challenging operation), and the sharing problem,
as multiple pointers could point to the same data.   Now, to keep it in
one file, you also now need storage allocation mechanisms, garbage
collection of some form, access control, .... basically you have invented
a new filesystem in a file.   It can certainly be done, but it seems
like a lot of work.

Direct inode attachment is the easy way.   The problem with this is that
it solves exactly one problem, and doesn't help at all with any other
problems - start doing direct inode attachment for all the data that it
might be nice to be able to associate with a file, and you soon have
jumbo inodes, and rather a messy construction.

I prefer something that will generalise better, and also I'm dead lazy,
there's no way I'm going to create a whole new filesystem like structure
if I don't have to.

Jason mentioned work that has been done with portals to implement ACLs.

As a method for working on the first problem (and there are lots of
issues there that need solutions) that sounds great.  It would certainly
be the quick way to get ACLs to exist, and be able to test real running
code with them.

As a long term solution, I don't think it really works all that well.
The problem is that the layered filesystem needs somewhere to store its
data (the portal needs somewhere to get the ACL data).  In what follows
I will use ACL as the example - but the same applies to other kinds of
associated data, like the Dreamcast "Visual Memory" that Marcus mentioned.

For prototyping that's not a problem, there are lots of things that can
be done that will work well enough.   But for a real use, the solution
needs to make sure that a file and its ACL (as well as its inode data) all
get backed up together, and can all be restored together (it is no good
having the file here on this tape/CD if its ACL is over there somewhere
else).  You could put the data all in a big file, but that's then just the
same problem as the "quota file" issue above - the only difference is
exactly where in the kernel the ACL checks are done (and that's part of
the problem I prefer to leave to someone else...) or you could set up a
whole parallel tree of files, that contain the ACLs, and have the ACL
layer keep track or where you are in the real tree and where you are n the
ACL tree.

That's almost where I started when I decided that "associated files"
was probably the most general, and reasonable, implementation method
(though it certainly only works on filesystems whose definitions we
can change - it might be possible to layer it into the CD filesystems,
the same way Joliett and Rockridge are done, but attempting to stick
it into an NTFS, or a MSDOSFS, or a ... would be close to impossible).

That is, the "parallel tree of files" is pretty easy to implement,
and uses the existing filesystem code to do just about everything,
which is great, but it means lots of additional overhead to keep
track of all the kinds of associated data that might exist, there
isn't just one possible thing (ACLs) but lots - that means a lot of
trees of data to keep.

Also, if you're keeping things in sync in the trees by name (which is
the obvious, and perhaps almost only reasonable way to do it), then
you get two immediate problems - first, applying an ACL to a
directory gets hard - you have to invent a new reserved name in the
directory to contain its ACL, which means there's a reserved name for
a file that can't be created in the ordinary tree, which is bad.
And second, it gets to be quite hard to apply ACLs to the ACL files.
That also isn't great.

All of that is what led me to think that linking these additional
files to the files that need them is the better way to associate
additional data with files.   The overhead to find the data is then
quite small, there's no problem at all with file naming, it is easy to
provide associated data to an associated data file, for example, an
ACL file can easily have an ACL saying who is allowed to change it,
what's more, the ACL that the ACL file has can easily be itself...

The implementation of ACLs could still be done in a layered filesystem,
that implementation could even use different methods for finding
(and altering) its data, depending on whether or not the associated
data implementation exists on the filesystem in question or not
(I suspect that without it, the implementation will be more difficult,
and more limited).

That's why in my previous message I said that I had been considering an
implementation of a thing that might be useful for ACLs, it wouldn't
be limited to that, and certainly isn't primarily concerned with ACLs.

To give a few more details (much of this is still just air work at
the minute, I'm sure there are holes).

The pointer to on-disc associated data would be 64 bits.  That 
contains 2 two byte fields, and one 4 byte field.

Most of the time, the 4 byte field is an inode number.   The two
byte fields give the type of data stored at that inode (as it
applies to the file containing the reference)  (Why there are two
of those I might get to later if I don't run out of time).

The inode number causes the link count of the referenced inode to
increase.

If a file contains just one of these associated files, then the
pointer would go in the spare 64 bits we have left in FFS inodes.
If the file contains more than one, then it would grow an
indirect associated file pointer page (one disk frag/block)
which would contain the 8 byte pointers (part of the reason that
there are 2 2 byte fields is to distinguish these cases).  I suspect
that 1024 associated files would be enough for almost anyone (an 8K
block, with 8 byte pointers) - even 512 if the blocksise is just 4K
(of course, if fewer are needed - 99% of the files that have associated
data) then fragments can be used (so just 512 bytes overhead in those
comparatively rare files that have more than 1 kind of associated data).
But we could have associated data files (a whole file, whose purpose is just
to contain pointers to other associated data files as its data) if
we want to allow unlmited numbers of associated data files.   They're
use in the obvious way (and would usually be the sole associated data
for the file to which they're attached).

Then there'd be a sys call for taking a file and attaching it to
another file as associated data of type X.   That would be root
only.  Setuid programs would exist for different kinds of X, would
validate the file being attached, make sure it is of the correct
type to be an X (an ACL would probably be a new inode type completely
which only privileged programs could write in, to make sure its
data format stayed correct for the kernel to use, or it could just
be a file owned by root and 600 in the "traditional" permissions)
and then perform the sys call.

Then there'd be another sys call, "give me the associated data
of type X associated with this file handle (descriptor)", which would
do an open on the associated file (after checking its permissions
etc, all the usual stuff) and return a file descriptor.   And there'd
be a "link the associated file of type X of file Y to name Z" so
if all names for the associated file have been removed, it can be
recovered into the filesystem to be operated upon.

Now, questions of compatability with "non-ACL" kernels were raised.
That's a very valid question.   First, a non-ACL kernel (a kernel
that has no idea about associated data) isn't going to care in the
slightest, the filesystem will still look like one of its old
filesystems, nothing different at all.   It would have some inodes
marked allocated that it will never use, and some blocks missing from
the freelist that it will never be able to find, but that's harmless.

Running an old FSCK on such a filesystem is a different question
though - and one simple answer is "don't do that".  It would be
simple enough to make a new fsck that could run on any old system
and would understand the new formats.

But if it were done, the effects would be that any associated data
files that were linked into the filesystem would have their link
counts "corrected" to count only the filesystem links.  Running
a new fsck later would "uncorrect" that harmlessly.

Any associated data files that had no filesystem links left would
be recovered into lost+found.   Of itself that's harmless, the
new fsck would then notice that the link counts of those files were
all wrong, and correct them, just as in the previous case (except
that the files would have been linked back into the filesystem
as well, in lost+found).

The harder problem is the blocks that are used as the indirect blocks
for the pointers - an old fsck would simply see them as missing
unallocated blocks, and would reclaim them into the free block bit maps
and then potentially the filesystem would reallocate them).

That would mean that a new fsck of the filesys would find lots of
duplicate blocks between indirect pointer blocks, and data blocks
in other files (or between indirect pointer blocks, and free blocks).
That would not be nice - it would at least have to zap all the
indirect pointer blocks (even if they're still listed as free there's
no way to know that they haven't been used and freed again since last
containing pointer data - though a checksum might perhaps avoid some
of that).  It might also have to zap all the files which used those
blocks, because knowing the real cause of a duplicate block is never
easy.

One solution to this might be to have all the pointer blocks (in
reality, on most filesystems, there wouldn't be many...) linked into
a magic file, which exists just so the blocks don't look to be free
to an old fsck.   Whether that's better than "don't do that" I'm
not sure...   It raises other problems of its own (not the least being
that one file can only have one (group of) frgments, at its end, and only
when it is a small file - so the extra data code would probably need its
own fragmentation handling code (always allocate blocks, and then hand
out fragments as pointers as needed).   Messy.

That is it for associated data in general I think.

For ACLs, assuming there's a new ACL inode type, then an old FSCK
would also like to clear that inode, and zap it, so, running it
would tend to destroy all the old ACL's in the system.   Another
reason for "don't do that".

Keeping backwards compatability this way (I mean, so you can take
new stuff and use it with old systems) is always hard, and there
always gets to be a point where it simply isn't worth doing any more.

In this case, assuming any of this was to actually happen, the problem
could be lessened by creating the new fsck in advance, and shipping it
with systems on which none of the data will ever actually exist, so
it is there and available, on systems that don't need it - but so that
if later, a new filesystem is brought in contact with that system,
then its fsck wouldn't mangle the world.

For example, such a new fsck could be shipped with 1.6, 1.5.2, 1.4.4
(assuming that those releases ever get made, I assume there will never
be a 1.3.4 now...) and the associated data filesystem only shipped
with 1.7.

And now I have to stop...

Anyway, this is a kind of snapshot of what I have been thinking about.
Please feel free to tear all the technical holes in it that you can
find.

On the other hand "not the BSD way" and noise like that I will simply
ignore (it isn't as if that comment is even correct).

kre