tech-kern: Re: RAW access to files

Subject: Re: RAW access to files
To: Chuck Silvers <chuq@chuq.com>
From: Wojciech Puchar <wojtek@chylonia.3miasto.net>
List: tech-kern
Date: 12/12/2001 22:55:55

> one situation where unbuffered i/o is useful is when the application is
> doing random i/o on a data set much larger than memory (eg. a database).
> in this case, read ahead is not useful, and the application is usually doing
> its own caching, so caching the data again in the kernel wastes memory.
good example
> the other benefit from doing the i/o straight into the application's
> memory is that it saves copying the data from the cache to the
exactly. with modern disks no memcpy is lot of saving

> application's memory.  obviously this isn't something that most applications
> would be interested in, but for databases it's a big improvement.
>
> for applications that do large runs of sequential i/o, the same logic
> applies as well, if the application isn't going to access the data multiple
> times and it uses i/os of at least 64k.  read ahead doesn't gain you all
> that much when you're doing large i/os, especially on modern disks that
> do read ahead into the cache in the disk.  in this case there's basically
> no difference to the disk driver between doing the i/o to filesystem cache
> pages vs. doing the i/o straight to the application, and doing it direct saves
> a memory-to-memory copy.  for large i/os, we can also use UVM loaning to
> avoid the memory-to-memory copy in some cases, but in cases where loaning
> is impossible (or undesirable for whatever reason) then direct i/o
> would also be valuable.

what about such algorithm?

#define threshold (1<<20)
#define large_io (1<<16)

read_data(params) {
 if(last_read_bytes_after_last_open_or_lseek>threshold) {
  if(io_size_requested>=large_io)
   do_direct_read_to_userspace(...); else
   do_normal_read_but_mark_bufferpage_as_first_to_reuse(...);} else
  do_normal_buffered_read(....);

while at least threshold should be definable through config or (better) in
ffs data or (the best IMHO) as additional field in disklabel (as it's disk
dependent).