Subject: Re: MTD devices in NetBSD
To: Bill Studenmund <wrstuden@netbsd.org>
From: marty fouts <mf.danger@gmail.com>
List: tech-kern
Date: 03/23/2006 13:29:19
For performance and space reasons, NAND needs to have the concept of
two different size "blocks".  The literature calls these various
things, but the easiest terminology to use is erase-block and
write-block.  The write-block is small, either 512 bytes or 2k bytes,
in most nand parts, and is the unit of I/O.  You can't read or write
except in write-block increments and once you've written a
write-block, you can't rewrite it until you erase the erase-unit
containing it.  Erase block sizes are typically much larger, often
128kb.

If you try to match file system block sizes to the erase block size,
then you need to do all sorts of clever stuff in the file system for
putting data from more than one file into a single data block, or you
end up wasting a lot of space.

Typically this problem is solved by using a file system block size
that matches the write-block size and modifying the file system so
that it doesn't write zero fill blocks.  But that leads to a need for
garbage collection.

It is the exercise of managing the data structures necessary for
garbage collection that makes for the design trade-offs in the design
of nand-aware file systems, and the reason why so many of the
commercial NAND file systems use an intermediate layer to handle block
management.

Mike Chen, at PalmSource, modified LFS to work with NAND.
Unfortunately, that's PalmSource's IP, and was in PalmOS 6, so we'd
have to do the (estimated 18 man month) project over again to make LFS
work for NAND. Mike's work was pretty carefully tuned to the specific
NAND parts we were working with and took a lot of advantage of
geometry.

JFFS2 tries to work for NAND and NOR by not taking advantage of NAND
OOB storage. It is extremely slow as a result of the overhead that
introduces, and JFFS3 is being written to replace it.

YAFFS2 takes advantage of the NAND OOB data and is much faster than
JFFS2, but, of course won't work as-is on NOR.  (We've discussed ideas
for modifying it so that a region of NOR could be made to serve as if
it wre the NAND OOB. This seems doable.)

I am familiar with three commercial NAND file systems, M-Systems MDOC,
Datalite's RelianceFS, and Samsung's RFS. MOD and RFS are optimized to
specific hardware, and all three use a "flash translation layer"
between the FS and the flash part.

You can't directly map a "read-only" file system on top of NAND if
that file system block numbers as part of its data structures, unless
you have a remapping layer to deal with bad blocks or are willing to
work only with NAND parts that have no bad blocks. (All NAND parts
have bad blocks...)

There really are about two ways to go with NAND:

1) export a very simple block device model that uses ioctls for
getting/setting configuration, getting/setting OOB data, handling ECC
and erasing and does read/write at the write-block size

or

2) the above plus a "flash translation layer" that does garbage
collection, bad block management, and presents a simple block device
interface.

Marty