Re: adding linux syscall fallocate

On Mon, Nov 18, 2019, 23:34 Jason Thorpe <thorpej%me.com@localhost> wrote:

> On Nov 17, 2019, at 11:21 AM, HRISHIKESH GOYAL <hrishi.goyal%gmail.com@localhost> wrote:
>
> Questions:
> 1. As what I follow from the above stackoverflow answer and truncate man page, even though `truncate` doesn't allocate space for file baz but filesystem should still update the free space by reducing it to 0.3G(otherwise filesystem metadata are not consistent with file metadata). Could anyone please correct me?
>
> 2. Does it mean that `truncate` only updates file vnode (i.e. size) attribute and doesn't update super block (free_space) attribute?
>
> 3. I checked first 100 bytes in both above files using c lang fread() function, all are filled with NULL character ( '\0' ), how file bar (previously fallocate'ed file) got initialised with NULLs(as per my understanding since they are uninitialised, they should be some random bytes.. and not all nulls right?).

I think what you are missing is that that many file systems support sparse files. Consider an application that does:

1- Create file "foo".
2- Write a single byte to offset 0.
3- Write a single byte to offset (4GiB-1).

That file will have a logical size of 4GiB; this size is recorded in the inode. However, on FFS, it will only have 2 file system blocks allocated. The direct and indirect block pointers for the whole middle range will not point to any physical space on disk[*], and when an application reads from that range, the file system will return zero-filled pages.

[*] ...a little bit of hand-waving some of the details here; some of the indirect block pointers will in fact be filled in, because they are needed to be able to find the block at the end of the file that's actually allocated, and at 4GiB, you're definitely into indirect block territory.

This is similar to what happens when you call truncate() on a file with a size beyond the current EOF, only in that case, you didn't need to write a byte to the end to get the size to change; there's simply no block allocated to the end of the file.

Now, what happens if you do a posix_fallocate("foo", 0, 4GiB)? The file system will have to allocate all of the necessary space, FILL IT WITH ZEROS, and fill in the direct and indirect block pointers in the inode.

Now, a file system is allowed to make an optimization, here. The posix_fallocate() specification does state that if offset+len is beyond the current file size, that the file size will be updated, i.e. it behaves like ftruncate() in that regard. However, the file system is allowed to NOT zero out the space SO LONG AS it knows that the space is uninitialized and thus return zero-filled pages when the space is read. This allows the file system to avoid redundantly filling the space with zeros only to have those zeros overwritten with actual data later. This is good for performance AND for reducing PE cycles on flash storage. This would require an additional size field in the inode to indicate the end if the initialized space (this information would have to persist across unmounts, and essentially represents an incompatible format change in the case of FFS since software that does not understand this extra field could not safely mount the file system).

Technically, a file system is allowed to make that optimization for the "allocate to fill in a sparse hole" case as well, but it would require a bunch of extra metadata to track the valid ranges of the file, and so probably isn't worth it.

-- thorpej