Subject: Re: MTD devices in NetBSD
To: Garrett D'Amore <garrett_damore@tadpole.com>
From: Garrett D'Amore <garrett_damore@tadpole.com>
List: tech-kern
Date: 03/23/2006 12:38:32
Bill Studenmund wrote:
> On Thu, Mar 23, 2006 at 10:26:31AM -0800, Garrett D'Amore wrote:
>
>> Bill Studenmund wrote:
>>
>>>
>>> We can do this even within a block device.
>>>
>>> Well-chosen calls to your strategy routine will work smothly, and you have
>>> an ioctl interface for things like erase and whatever other calls you
>>> need.
>>>
>>> I guess a way to put it is to think of using one interface in two
>>> different ways as opposed to an interface "below" another one.
>>>
>> I've been thinking about this as well. I think this idea implies that
>> the "block" size of these things would match that native sector size.
>>
>
> Yes & no. We can look at how cd9660 handles this, as it has the same
> issue (2k sectors != 512 byte sectors).
>
Thanks for the reference. But of course, 2k/512 is a *lot* different
ratio than 64k/512. Especially since the media size for cd9660 is
usually hundreds of MB, and for flash it is likely to be ~10MB.
>
>> Mapping blocks to sectors 1:1 also means that for a lot of filesystems,
>> you are going to have a lot of waste (e.g. does the filesystem allow for
>> files to use less than a full device block) -- and this could be very,
>> very undesirable on some systems. (E.g. 128K minimum file size on 4MB
>> flash limits you to only 32 files. 16MB only gives 128 files.) 128K
>> sector sizes are rare, but 64K sector sizes are *very* common. So you
>> get 256 files in a 16MB "common" case.
>>
>> Hence, I think 1:1 block/sector mapping is a poor (even unworkable) choice.
>>
>
> Can you read less than a block in these things?
>
For NOR, absolutely. Many NOR systems are actually mapped *directly*
into system memory. I presume this to be true (that you can read less
than a sector, not the mapping bit) for NAND, but I confess I'm still
largely ignorant of NAND.
>
>> So, if the abstraction is going to use a smaller block size -- say 512
>> bytes -- to get good allocation, we have other problems:
>>
>> For the rest of the discussion, lets assume a 64K sector size (the most
>> common NOR flash size, I think):
>>
>> A naive implementation would make updating a sector an erase/modify
>> cycle. Obviously this is bad, because writing (or updating) a 64K file
>> now requires 128 erase cycles. Erase takes a long time, and wears down
>> flash. This is unworkable.
>>
>
> Wait, I'm now confused. I thought we had one of three cases:
>
> 1) we have a flash-unaware file system sitting on a flash. This would be
> intended as a r/o kinda thing to help with bring-up.
>
Yes.
> 2) We have a flash-unaware file system on top of a wear-leveling layer on
> the flash. This should work r/w.
>
I'm not necessarily proposing this. Others may be, but not me.
> 3) We have a flash-aware file system sitting on a flash.
>
> The case above isn't one of those three, so why do we care?
>
I think we're misunderstanding each other. Updating a sector (for any
r/w case) where you modify less than the whole sector at once creates
the problem. This happens in case #2 above. (And also case #3 if the
flash aware system uses a block size != sector size, and wants to update
a large file. My understanding of strategy is that you only get one
block at a time, not a list for the entire file.)
>
>> So a non-naive implementation means you have to look at the bits you are
>> updating to decide whether or not an erase is necessary. This means
>> knowing the "set/clear" behavior of the bits, which isn't a problem.
>> (The devices I've seen are all "set" on erase, and you can only clear
>> individual bits.)
>>
>> But now, when I'm writing a 64K file I'm going to have to do 128 reads,
>> writes. And, if the sector is unfortunately got a single bit clear near
>> the end, I've not detected this case, and I wind up having to do a
>> read-modify-write even after I've done all the work to try to avoid it.
>>
>
> i'm still confused. :-) 1) I don't think a file system will really use
> 512-byte blocks internally. You'd have to specificaly set it, and I'm not
> sure it'd be worth it.
>
You need small blocks in any case (64k is too wasteful). Even at 8K
you still have 8 read/modify cycles.
> 2) If you're writing a 64k file, you aren't going to have 512-byte writes
> coming in unless you've mis-configured dd. ;-) stdio will do 8k i/o, and
> you'll get better performance with large block sizes in dd...
>
Okay.
>
>> If I operate on sectors natively, and expose that to the filesystem,
>> then the filesystem can do an upfront check, erase the sector as needed,
>> and *then* do the write, all at once. (Assuming again we are writing a
>> 64k file.) Since the filesystem knows its a 64k write, it can do "the
>> right thing".
>>
>> I think this means that the filesystem should *really* have a lot more
>> direct control over the device, and be able to operate on sectors rather
>> than blocks. (And we've already ruled out a 1:1 sector/block mapping,
>> at least if you are going to want to be able to put any other kind of
>> ordinary filesystem down on these for a readonly filesystem.)
>>
>> Therefore, I'm coming to the conclusion that we need to expose *sectors*
>> to a flash-aware filesystem, and the block abstraction is poor for these
>> filesystems.
>>
>> Am I missing something here?
>>
>
> I think you're painting yourself into corners we don't need to be trapped
> in.
>
Possibly. But exposing the details rather than hiding them seems to me
to be *avoiding* corners. If I hide sector and wear-leveling details
behind some kind of meta-device or flash translation layer, then I fear
it will limit choices that we can make otherwise.
> If the flash-unaware fs is only used in r/o mode, why do we need to worry
> about its write performance?
>
We don't. But the flash-aware filesystem needs to have access to
something other than blocks/strategy, I think.
> The, "It's HARD to solve the problem," reason is quite reasonable at
> times, and this may well be one.
>
Heh. Maybe. I'm trying to make the problem tractable, because I need
to implement *something*, and soon.
-- Garrett
> Take care,
>
> Bill
>
--
Garrett D'Amore, Principal Software Engineer
Tadpole Computer / Computing Technologies Division,
General Dynamics C4 Systems
http://www.tadpolecomputer.com/
Phone: 951 325-2134 Fax: 951 325-2191