Subject: Custom FIFO filesystem from userspace
To: None <tech-kern@netbsd.org>
From: Matthew Mondor <mm_lists@pulsar-zone.net>
List: tech-kern
Date: 11/28/2005 09:27:47
Hi all,

I have been pondering about possibly using partition devices directly
and custom recovery log support for a project, implementing a simple
FIFO objects system where the only operations would be the following:

- load/save device configuration (to a special area intended for this)
- storage of new arbitrary sized files (using a FIFO-buffer mechanism
  with random access I/O functions), when reaching the end of the buffer
  rollback to the top would be done and older objects discarded.
  Each file object will be stamped with time using a special index
- query of objects back in time at specific times or in-between time
  ranges (using sort of a cursor running through the index).  There will
  be no name based queries, the time stamp will be used instead.
- fast recovery in case of system crash/restart to minimize boot
  time so that the device may immediately resume its tasks
- there would be no support for concurrent operations.  All commands
  would be synchronized sequencially.  New files additions operations
  will occur far more frequently than read and query operations.
  From four to sixteen files would be written per second and their
  size would vary between 30KB-250KB each.

Raid could potentially be used for reliability, but it is unclear yet if
we'll use a custom mirroring implementation with a second device
instead.

Searching on the subject, i.e. about RDBMS which support using
partitions for storage, I came across documents recommending to use raw
character devices instead of the block ones (apparently because block
operations would internally use buffering).  Anyone know if this is the
case with NetBSD, and that using the raw character device would really
ensure unbuffered I/O?  Also, if using write(2), should fsync(2) or
fdatasync(2) still be used to ensure synchronization to a raw device?

Other questions which have been raised are related to the geometry;
Would it be advised to use the BIOS and/or NetBSD provided geometries
obtained via ioctl(2) for performance and/or reliability considerations?
Or can the system say, be designed to use 64KB blocks and implement
everything using such aligned fixed sized blocks?

Other considerations I have to take into account relate to reliability
and fast crash recovery.  I am familiar with custom logging techniques
to achieve fast recovery discarding any partial transactions and have
implemented this at various occasions for projects, but these were using
a system of log files with automatic rotations, and explicit
fdatasync(2) used after a number of elements were written or amount of
idle time reached.  These were exclusively built on top of unix
filesystems.

This time I'll have to deal with a raw device instead. I can imagine
using a special FIFO buffer for log entries, and possibly still using
fsync(2) or fdatasync(2) to ensure synchronization to the device (if
needed), but, are there any other special considerations I should be
aware of?  Should there for instance be duplicate backup of the log? 
Would it be possible for a block to only be partially written to disk if
the system crashes?  Should I use a double writing technique?  Sequence
numbers would permit to determine the most recent log entry, but is
there any way to guaranty that a transaction logging block is truely
commited?

Any details about those aspects would be very appreciated.  I am about
to soon setup a test box to develop a sample implementation and perform
throughly testing, especially the recovery part, it's likely that I'll
need to power off the device during writes and evaluate recovery
efficiency, etc.  Other filesystems would either be in memory only
or mounted read-only such that fsck will not be necessary.

Loosing a few seconds worth of data in the event of an unsuspected crash
isn't a problem, but the system should be clean, unfragmented and
reliable after a very quick recovery (except obviously in case of
hardware problems).

Thanks,
Matt

-- 
Note: Please only reply on the list, other mail is blocked by default.
Private messages from your address can be allowed by first asking.