tech-kern: RE: Google Summer of Code 2006

Subject: RE: Google Summer of Code 2006
To: Jan Schaumann <jschauma@netmeister.org>
From: Gordon Waidhofer <gww@traakan.com>
List: tech-kern
Date: 04/15/2006 13:27:47
For the Google Summer of Code 2006, I'd like to suggest
a project to add subfiles to NetBSD somewhat like Solaris
has done (under the name Extended Attribute). Subfiles
are important for supporting the NT file model and will
enhance Samba support. They are also important for NFSv4
(called Named Attributes) and are already supported by
Sun Microsystems, Network Appliance, and EMC.

I can't volunteer to do the project, but perhaps this
proposal can be posted under the Filesystems catagory
and somebody might take interest.

Regards,
        -gww



================================================================


Google SOC Project: UFS/UFS2 subfile support




Introduction
============

Subfiles are an old idea and there are several
mature implementations. Yet they are not fully
understood. What should the APIs be? What should
the access control model be? What are the
implications for other file technologies like
utilities, backup, browswers, email, ftp, nfs,
tar, etc? What influence do subfiles have on
other file features, file interfaces, and file
systems? There is an obvious benefit of having
compatibility with the Windows NT file model.
Are there other benefits? Are subfiles a wizzy
solution looking for a problem? Or are they
a fantastic idea long overdue for deployment?
Or something in between?

What is needed is a pilot design and implementation
of subfiles for NetBSD, and the BSD world in general.
This will serve as a labratory for evaluating
designs, exploring issues, and participating in the
formation of subfile semantics.




Background
==========

Subfiles appear in a number of operating systems
and other technologies and will have growing
importance:

        * NTFS has ADS (Alternate Data Streams)
        * Solaris has Extended Attributes, a
          misnomer and not to be confused with
          BSD/Linux EAs
        * Apple Mac has resource/data forks
        * NFSv4 has Named Attributes

The Samba folks are looking to support ADS using
subfiles. It is an active area of interest in
Linux kernel development. Although NFSv4
originally envisioned supporting Named Attributes
using Linux/BSD/OS2 style EAs (sometimes called
weenie EAs) the course of events has led to NFSv4
Named Attributes being supported by subfiles.
Implementations include Solaris, NetApp, EMC, and
Hummingbird (NFS on NT).

The semantics of subfiles are well represented by
the Solaris EAs. However, the Samba and Linux
folks have found subtle problems with the Solaris
API. Solaris marks the APIs as experimental.

Still, Solaris has done a good and thorough job
and can be used as a design model. See fsattr(5)
in the Solaris 10 reference manual.

Please see http://nasconf.com/pres04/waidhofer.pdf
for further summary and references.




Suggested APIs
==============

The APIs on Solaris have met with criticism. The APIs
of NT are probably right out. The Linux APIs are
being contemplated. Here is a suggested API intended
to be simple and clear but not permanent.

Trond Myklebust [trond.myklebust@fys.uio.no] is
also actively contemplating the APIs for Linux.
The Solaris folks are also reevaluating their API.
Expect subfile APIs to be a fluid topic
for some time.

Please see the "What's wrong with Solaris...." section, below.

        int
        subfile_open (
          char *        basefilename,
          char *        subfilename,
          int           open_flags);

        int
        subfile_fdopen (
          int           basefilefd,
          char *        subfilename,
          int           open_flags);

These return normal file descriptors (fd) that can
participate in close(), fstat(), read(), write(),
lseek(), and all other normal file APIs.

Subfile_fdopen() can not be used to create subfiles
on subfiles or on subfile directories.

Subfile_open (basefile, ".", O_RDONLY) can be used
to open the subfile directory.

Variants of opendir(3) and fopen(3) can be built
around these primitives. It's also worth implementing
the Solaris attropen(3) library interface
for compatibility.

        int
        subfile_stat (
          char *        basefilename,
          char *        subfilename,
          struct stat * sb);

        int
        subfile_fdstat (
          int           basefilefd,
          char *        subfilename,
          struct stat * sb);

        int
        subfile_remove (
          char *        basefilename,
          char *        subfilename);

        int
        subfile_fdremove (
          int           basefilefd,
          char *        subfilename);




File system changes
===================

This approach is about what Solaris UFS did.
It leverages the inode-centric model which
makes things a lot easier. The alternative
-- to invent whole new data structures, allocation
methods, directory/index methods, etc --
is very difficult and offers no advantage. However,
an inode is at this time supersufficient for the
purpose and so some inode fields are redundant.
That may change in time.

Add two inode types:

        IFSFDIR         subfile directory
        IFSFREG         subfile proper

An eligible file system object may have subfiles.
This will be called the base inode. Eligible file
systems objects are IFREG, IFDIR. In future, more
types (IFCHR, IFBLK, etc) may become elligible.
Add a field to the inode structure to reference the
IFSFDIR inode, 0 means there isn't one.

An IFSFDIR:
        * is referenced by exactly one base inode,
          add a back pointer to the inode struct
        * contents are same as normal directory,
          struct direct.
        * contains a "." entry that references
          the IFSFDIR.
        * contains a ".." entry that references
          the base inode
        * entries may only reference IFSFREG
          there may be no subdirectories or
          device nodes
        * is instantiated on demand
        * is removed when the base inode is removed
        * does not have independent access control
          (uid, gid, mode), refer the the base inode

An IFSFREG:
        * is just like an IFREG
        * could (should?) contain a back pointer to the IFSFDIR
        * can only have a link count of 1
        * can be deleted explicitly (see subfile_remove(), below)
        * is deleted when the base inode is deleted
        * does not have independent access control
          (uid, gid, mode), refer the the base inode

When a base inode is deleted, the subfile tree has
to be deleted also. This makes remove() a much more
complicated operation. There are likely to be tricky
locking issues.

VFS/vnode operations will have to be defined and implemented.




File system utilities
=====================

fsck(8) and dump(8) will have to be modified to
recognize and audit the new inode fields and inode
types. This actually is the genius of the Solaris
approach. It leverages the existing utilities by
leveraging the existing inode-centric mechanism.




User utilities
==============

The list of user file utilities is fairly obvious:
ls(1), cp(1), tar(1). Solaris defines a new '-@'
command line option to these utilities to specify
that subfiles should be operated upon. See Solaris
10 fsattr(5) manual page for a more complete list.

Tar(1) format modifications for subfiles was done
by Solaris. Interoperability with the Solaris
tar(1) should be a project objective.




NFSv4 support
=============

Adding NFSv4 support to NetBSD would be a project
in itself. Confering with the developers of NFSv4
for BSD would be advisable.




Samba support
=============

This is far more feasible than NFSv4 support.
Perhaps even more interesting and important.

Samba has a file switch (like the VFS layer)
that would allow straight-forward support
of NT ADS using NetBSD subfiles.

NT uses subfiles (ADS) to hang thumbnails
on image files. Put a fresh copy of a thousand
largeish images in an NT directory (folder).
The first time you access with "View Thumbnails"
it takes a while. But after that it is quite
fast. Now do the same on a Samba share. It's
slow every time.

With NetBSD subfiles and the right tweaks to
Samba, a Samba share should perform as quickly
as NTFS on a local disk.

At least that's the theory. It's never been
put to the test. It's real braving-the-frontier
stuff!

(Samba on Solaris may have this support by now).




Subfile access control
======================

Access control of subfiles is a hot issue. There
are security vulnerabilities in NT based on ADS.
What does it mean to have a r/w base file and a
r/o subfiles? What does it mean to have a r/o
basefile and a r/w subfile? Should a subfile have
an independent ACL? There are lots of opinions but
nobody really knows. Although NFSv4 is prepared
for whatever access control model emerges, the
spec and the community have no recommendations.
The opportunity to explore these issues alone
justifies a pilot implementation.

For the pilot implementation, the suggested access
control is to refer to the base file. It is
simplest (from a design point of view) and
there is a logic to it -- at least at first.

Suppose a file is r/w. To organize content within
the file all sorts of data structures and indexes
could be used. Or well known, sparse seek offsets
could be used. For example, offset 0 is the content,
offset 100g is the primary index, offset 200g is
the secondary index, offset 300g contains backup
advice. Complex data structures can be awkward.
Sparse files can also be awkward. Whatever approach
is taken to organize data in a monolithic file, there
is still exactly one access control path.

Subfiles allow a convenient way to organize data.
They are -- in one view -- an alternative to
well known seek addresses. The base file has
the content. Subfile "primary" contains the primary
index. Subfile "secondary" contains the secondary
index. Subfile "dirtymap" contains backup advice.
It is far less awkward. Putting all of that under
a single access control path is entirely consistent
with the lseek() or data structure model.

In future the access control issues will be
better understood and conventions will emerge.
Indeed, the NetBSD community can influence that.

Start with simple.




What is wrong with the Solaris openat() API, et al?
===================================================

The Solaris openat() API, et al, predates Solaris
work on subfiles (EAs). These interfaces were
originally done to save pathname traversal. For
example, when walking a directory tree (think
find(1)), wouldn't it be nice to stat() entries by
giving the file descriptor of the enclosing
directory followed by the name? Rather than
stat()ing x/y/z/a/b/c/f, use statat() passing the
file descriptor for directory x/y/z/a/b/c and the
name "f". Openat() was extended with the O_XATTR
flag to support subfiles.

The favored Solaris interfaces is attropen(3). It
takes the name of the base file and the name of
the subfile as arguments. It is implemented as a
library that first open()s the base file, then
openat()s the subfile, then close()s the fd of the
base file. That's where a subtle flaw appears. By
close()ing the fd, any POSIX locks held by the
process on the base file -- even through a
different file descriptor -- are lost.

Openat() and friends were once proposed on the
NetBSD list for the benefits of reducing pathname
traversal. However, there were coherent objections
based on security concerns. These have not been
fully explored.




==END====END====END====END====END====END====END==


> -----Original Message-----
> From: netbsd-advocacy-owner@NetBSD.org
> [mailto:netbsd-advocacy-owner@NetBSD.org]On Behalf Of Jan Schaumann
> Sent: Saturday, April 15, 2006 6:55 AM
> To: netbsd-advocacy@netbsd.org; tech-misc@netbsd.org;
> pkgsrc-users@netbsd.org; netbsd-users@netbsd.org
> Subject: Google Summer of Code 2006
> 
> 
> Hello,
> 
> My apologies for cross posting, but this concerns a fairly wide area of
> NetBSD aficionados:
> 
> The NetBSD Project is pleased to once again participate in Google's
> Summer of Code 2006 [http://code.google.com/soc/] as a mentoring
> organization.  A list of possible projects is available from
> http://www.netbsd.org/contrib/projects.html.  If you are interested in
> any of these projects or have other suggestions, please either contact
> me in private, post to the relevant mailing list, or post to
> netbsd-advocacy@netbsd.org.
> 
> Please remember that the list of projects is not complete (by far), and
> that we will be happy to add your suggestions and will accept applicants
> with their own ideas as well.
> 
> (Note: Please do not group-reply to all lists.  Instead, pick an
> appropriate list to follow up to on-topic.  General feedback should go
> to netbsd-advocacy@netbsd.org.  You have been warned.  And I apologize
> for cross posting.  Again.)
> 
> -Jan
> 
> -- 
> I'm not even supposed to be here today!
>