Subject: Re: MTD devices in NetBSD
To: Garrett D'Amore <garrett_damore@tadpole.com>
From: Garrett D'Amore <garrett_damore@tadpole.com>
List: tech-kern
Date: 03/23/2006 12:38:32
Bill Studenmund wrote:
> On Thu, Mar 23, 2006 at 10:26:31AM -0800, Garrett D'Amore wrote:
>   
>> Bill Studenmund wrote:
>>     
>>>   
>>> We can do this even within a block device.
>>>
>>> Well-chosen calls to your strategy routine will work smothly, and you have 
>>> an ioctl interface for things like erase and whatever other calls you 
>>> need.
>>>
>>> I guess a way to put it is to think of using one interface in two 
>>> different ways as opposed to an interface "below" another one.
>>>       
>> I've been thinking about this as well.  I think this idea implies that
>> the "block" size of these things would match that native sector size. 
>>     
>
> Yes & no. We can look at how cd9660 handles this, as it has the same 
> issue (2k sectors != 512 byte sectors).
>   

Thanks for the reference.  But of course, 2k/512 is a *lot* different
ratio than 64k/512.  Especially since the media size for cd9660 is
usually hundreds of MB, and for flash it is likely to be ~10MB.
>   
>> Mapping blocks to sectors 1:1 also means that for a lot of filesystems,
>> you are going to have a lot of waste (e.g. does the filesystem allow for
>> files to use less than a full device block) -- and this could be very,
>> very undesirable on some systems.  (E.g. 128K minimum file size on 4MB
>> flash limits you to only 32 files.  16MB only gives 128 files.)  128K
>> sector sizes are rare, but 64K sector sizes are *very* common.  So you
>> get 256 files in a 16MB "common" case.
>>
>> Hence, I think 1:1 block/sector mapping is a poor (even unworkable) choice.
>>     
>
> Can you read less than a block in these things?
>   

For NOR, absolutely.  Many NOR systems are actually mapped *directly*
into system memory.  I presume this to be true (that you can read less
than a sector, not the mapping bit) for NAND, but I confess I'm still
largely ignorant of NAND.
>   
>> So, if the abstraction is going to use a smaller block size -- say 512
>> bytes -- to get good allocation, we have other problems:
>>
>> For the rest of the discussion, lets assume a 64K sector size (the most
>> common NOR flash size, I think):
>>
>> A naive implementation would make updating a sector an erase/modify
>> cycle.  Obviously this is bad, because writing (or updating) a 64K file
>> now requires 128 erase cycles.  Erase takes a long time, and wears down
>> flash.  This is unworkable.
>>     
>
> Wait, I'm now confused. I thought we had one of three cases:
>
> 1) we have a flash-unaware file system sitting on a flash. This would be 
> intended as a r/o kinda thing to help with bring-up.
>   
Yes.
> 2) We have a flash-unaware file system on top of a wear-leveling layer on 
> the flash. This should work r/w.
>   

I'm not necessarily proposing this.  Others may be, but not me.

> 3) We have a flash-aware file system sitting on a flash.
>
> The case above isn't one of those three, so why do we care?
>   

I think we're misunderstanding each other.  Updating a sector (for any
r/w case) where you modify less than the whole sector at once creates
the problem.  This happens in case #2 above.   (And also case #3 if the
flash aware system uses a block size != sector size, and wants to update
a large file.  My understanding of strategy is that you only get one
block at a time, not a list for the entire file.)
>   
>> So a non-naive implementation means you have to look at the bits you are
>> updating to decide whether or not an erase is necessary.  This means
>> knowing the "set/clear" behavior of the bits, which isn't a problem. 
>> (The devices I've seen are all "set" on erase, and you can only clear
>> individual bits.)
>>
>> But now, when I'm writing a 64K file I'm going to have to do 128 reads,
>> writes.  And, if the sector is unfortunately got a single bit clear near
>> the end, I've not detected this case, and I wind up having to do a
>> read-modify-write even after I've done all the work to try to avoid it.
>>     
>
> i'm still confused. :-) 1) I don't think a file system will really use 
> 512-byte blocks internally. You'd have to specificaly set it, and I'm not 
> sure it'd be worth it.
>   
You need small blocks in any case (64k is too wasteful).   Even at 8K
you still have 8 read/modify cycles.

> 2) If you're writing a 64k file, you aren't going to have 512-byte writes 
> coming in unless you've mis-configured dd. ;-) stdio will do 8k i/o, and 
> you'll get better performance with large block sizes in dd...
>   

Okay.
>   
>> If I operate on sectors natively, and expose that to the filesystem,
>> then the filesystem can do an upfront check, erase the sector as needed,
>> and *then* do the write, all at once.  (Assuming again we are writing a
>> 64k file.)  Since the filesystem knows its a 64k write, it can do "the
>> right thing".
>>
>> I think this means that the filesystem should *really* have a lot more
>> direct control over the device, and be able to operate on sectors rather
>> than blocks.  (And we've already ruled out a 1:1 sector/block mapping,
>> at least if you are going to want to be able to put any other kind of
>> ordinary filesystem down on these for a readonly filesystem.)
>>
>> Therefore, I'm coming to the conclusion that we need to expose *sectors*
>> to a flash-aware filesystem, and the block abstraction is poor for these
>> filesystems.
>>
>> Am I missing something here?
>>     
>
> I think you're painting yourself into corners we don't need to be trapped 
> in.
>   

Possibly.  But exposing the details rather than hiding them seems to me
to be *avoiding* corners.  If I hide sector and wear-leveling details
behind some kind of meta-device or flash translation layer, then I fear
it will limit choices that we can make otherwise.

> If the flash-unaware fs is only used in r/o mode, why do we need to worry 
> about its write performance?
>   

We don't.  But the flash-aware filesystem needs to have access to
something other than blocks/strategy, I think.

> The, "It's HARD to solve the problem," reason is quite reasonable at 
> times, and this may well be one.
>   

Heh.  Maybe.  I'm trying to make the problem tractable, because I need
to implement *something*, and soon.

    -- Garrett
> Take care,
>
> Bill
>   


-- 
Garrett D'Amore, Principal Software Engineer
Tadpole Computer / Computing Technologies Division,
General Dynamics C4 Systems
http://www.tadpolecomputer.com/
Phone: 951 325-2134  Fax: 951 325-2191