tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

the bouyer-quota2 branch



Hi,
I think the code in the bouyer-quota2 branch is stable now, and
ready to be merged to HEAD. Unless objections, I'll merge it in
about 2 weeks.

To get a diff:
cvs -d anoncvs%anoncvs.netbsd.org@localhost:/cvsroot -kk -u -r bouyer-quota2 -r 
bouyer-quota2-base src

This branch is for the developement of a modernized disk quota system.
The 2 main changes are: a new quotactl(2) interface and a new on-disk
format, compatible with journaled ffs.

The new quotactl(2) uses a plist format to send commands and exange data
with the kernel. Using plists for this has several bonus:
- the plist format can change without the need to version the syscall,
  only the plist parser needs to be changed and backward compat can be
  at the parser level.
- the plist format can easily be extended to fit other filesystems than
  ufs.
- it is easy to pass it back to puffs servers
- it is easy to use in scripts.

the format used is documented in quotactl(2). A new quotactl(8) command
has been added, which allows to send/receive plist from userland;
the idea is to make it easier to manage quotas from scripts.
The branch has code under COMPAT_50 to deal with the old syscall.

The in-tree quota commands quota(1), edquota(8), repquota(8), rpc.rquotad(8),
quotacheck(8), quotaon(8) have been updated to use the new syscall interface.
I also took this opportunity to change the semantic of values reported by
these utilities (wich are also the values used in plists): 0
is "nothing allowed" (instead of 1 actually), "no limit" is represented by
the string "-" or "unlimited" (in the plist as well as the new
on-disk format this is UQUAD_MAX, i.e. 0xffffffffffffffff). The old disk format
still uses 0 as umlimited and 1 as nothing allowed; the semantic difference
is handled in kernel and userland convertion utilities (see quota1_subr.c)
repquota gains a -x option, which exports the quotas as a "set" plist command
which can be feed directly to quotactl(8). This is one way to move limits
from one fs to another (or convert to the new on-disk format).

A new on-disk format has been added (called quota2, see quota2.h).
The usages and limits are stored in unlinked inodes (one for users and one
for group quotas), it can not be stored outside of the filesystem any more.
This ensures that quotas are covered by the filesystem clean flag or journal.
A quota file has a header, containing some persistent parameters, a default
quota entry, and quota entries free and hash list. The quota file is not
sparse, quota entries are held in hash list. The kernel keeps a cache of
quota entries, which is keeps offset in the file to avoid to walk the list
on each loopup.

This new format has grown 64bis limits and usage (32bit is not enough for
modern storage sizes), and 2 new features:
- a default quota entry is used as template for new quota entries allocated
  when a new uid/gid shows up on the filesystem. This template is configurable,
  so that a sysadmin what to allow to unknown users.
- per-user/group grace time. 

quota are enabled with tunefs -q user and/or -q group (and disabled with
-q nouser -q nogroup), of at newfs time with the same -q option.
after a tunefs -q a fsck of the filesystem is required.

There is no quotacheck/quotaon anymore for quota version 2. quota usages
are checked in fsck_ffs(8) at the same time as other filesystem metadata.
Usages are computed phase1 (and adjsusted in othe phases if fsck needs to
create or delete files, or change block allocations) and checked against
recorded usages in phase6. phase6 will also do other consistency checks
against the quota inodes, or even create it if noone exists (e.g. just
after a tunefs). While doing this I discovered some pieces missing in
fsck_ffs about block accountings when allocating inodes and blocks,
which I fixed (This is why ffs_clusteracct() moved to ffs_subr.c,
as a bonus it's one less function replicated in makefs(8)).

Instead of keeping usages in memory, synced to disk on sync or
at umount time, quota usages are now updated as other metadata in
real time (or delayed write, depending on mount options). This way,
quota usages are also covered by the journal (usage update is in the same
WAPBL transaction as the one allocating/freeing inodes or blocks),
and so usages should be accurate after a log replay (quotacheck(8) is
basically a pass 1 fsck, and the time required for today's storage sizes
is just not acceptable).

This code has been tested in several way. In addition to the atf
tests in the branch testing basic functionalities (as well as some
corruption senarii for fsck_ffs), I did stress-tests on a XEN3_DOMU
with 256Mo RAM as well as on a dual-core i5 (with hyperthreading, so the
kernel sees 4 CPUs) with 2Gb ram. One of the stress test has been
to run 5 bonnie++ in a loop under 5 different uids, while at the same time
running quota(1), repquota(8), quotactl(8) commands in loops, on both
logged and non-log filesystems.
I also ran a bonnie++ in a loop while taking and deleteing snapshots
of the filesystems, also in loops. All issues discovered this way have been
fixed.
In order to have fsck_ffs against a snapshot report no error, I had to do
wider change. I added a per-inode flag, "SF_SNAPINVAL", used to mark a
snapshot inode as invalid. Right now, a snapshot inode shows up as a
0-size regular file in the snapshot, and userland tools don't know it is
a snapshot inode. The result is that quota usages are miscomputed by
fsck_ffs as snapshot inodes are not included in usage. Now  snapshot inodes
in the snapshot are marked SF_SNAPSHOT | SF_SNAPINVAL, so userland tools
know it's a snapshot (as a bonus, dump can ignore them as well), while
the kernel can deny using it as a snapshot.
I believe this flag can also be used to speed up snapshot creations, but
this won't be investigated as part of the branch.

Finaly here are some bonnie++ results on the code i5 above (i'll add that
the disk system is a 500Gb WDC WD5000AADS-00S9B0 on a ahcisata controller)
used for tests. "plain" is HEAD with plain ffs, "log" the same mounted
with -o log.
"quota1" is "plain" with user quota1 enabled (the quota file is at the
root of the test filesystem), "quota2" is "plain" with the new quota
enabled for user. "quota2log" is "quota2" mounted -o log (qouta1 and log
are mutually exclusive).
As you can see there is no measurable performance impact.

Version  1.03e      ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
plain            4G 71199  43 71717  11 30440   5 73216  77 74573   9 183.4   0
plain            4G 71972  44 71906  12 30446   6 73959  77 74637   9 178.4   0
plain            4G 71922  44 71800  11 30438   6 73756  77 74669   9 177.8   0
log              4G 69776  43 71641  13 30732   6 73709  77 74653   9 176.1   0
log              4G 71254  44 71404  12 30548   6 73968  77 74653   9 176.5   0
log              4G 71183  44 71581  13 30499   6 73400  77 74812   9 176.5   0
quota1           4G 70320  43 71792  12 30694   6 73787  77 74637   9 180.3   0
quota1           4G 71567  43 71772  12 30781   6 73774  77 74541   9 178.8   0
quota1           4G 71829  44 71669  12 30393   5 73324  77 74796   9 179.1   0
quota2           4G 70349  43 71311  12 30502   5 71670  75 74636   9 181.2   0
quota2           4G 72125  44 71486  12 30560   6 73385  77 74621   9 178.0   0
quota2           4G 71411  43 71379  12 30606   6 73772  77 74621   9 179.9   0
quota2log        4G 69453  43 71947  13 30700   6 73554  77 74748   9 177.7   0
quota2log        4G 70718  44 71635  13 30433   6 74192  78 74716   9 174.3   0
quota2log        4G 72394  45 71641  13 30681   6 73601  77 74684   9 177.7   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
plain            16  1693  25 +++++ +++  5220  14  1835  27 12712  99  3718  28
plain            16  1707  25 +++++ +++  5225  14  1783  26 12647  99  3665  28
plain            16  1800  26 +++++ +++  5127  15  1830  27 12697  99  3402  26
log              16  8687  88 +++++ +++ +++++ +++ 10006  99 12608  99 23303  99
log              16  9051  91 +++++ +++ +++++ +++ 10014  99 12652  99 23148  99
log              16  9868  99 +++++ +++ +++++ +++ 10027  99 12675  99 23300 100
quota1           16  1639  24 +++++ +++  5220  14  1704  25 12713 100  3614  27
quota1           16  1718  25 +++++ +++  5222  14  1628  24 12744 100  3659  28
quota1           16  1742  25 +++++ +++  4535  13  1854  27 12643  99  3720  28
quota2           16  1729  25 +++++ +++  5188  15  1940  28 12626  99  3743  29
quota2           16  1839  27 +++++ +++  5178  15  1750  25 12699  99  3647  28
quota2           16  1755  26 +++++ +++  5208  15  1739  25 12570  99  3581  27
quota2log        16  9227  94 +++++ +++ +++++ +++  9957  99 12686  99 23035 100
quota2log        16  9807  99 +++++ +++ +++++ +++  9252  92 12649  99 23301  99
quota2log        16  9789  99 +++++ +++ +++++ +++  9263  93 12682  99 23032  99

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--


Home | Main Index | Thread Index | Old Index