current-users: re: NVRAM for NFS

Subject: re: NVRAM for NFS
To: Aaron J. Grier <agrier@poofy.goof.com>
From: Jonathan Stone <jonathan@cm-24-142-94-15.cableco-op.ispchannel.com>
List: current-users
Date: 12/28/1999 12:28:45
>(this started on port-i386, but it seemed more appropriate to followup
>to current-users since building your own NetApp could be accomplished on
>multiple platforms...)

Sure.  cc'ed back to the original respondent, just so they know.


>On Mon, Dec 27, 1999 at 11:27:20AM -0800, jonathan@dsg.stanford.edu wrote:

>> The mid-80s advice used to be, get an NVRAM board so the servers can
>> log writes to NVRAM and reply, rather than waiting till NFS write
>> requests go all the way out to disk before replying. I have no idea if
>> the NetApps do that, but if they do, then write latency will be hard
>> to beat.

>This is _exactly_ what they do.  NetApps run (ran) RAID4 [1] and since
>they have NVRAM, can defer writes until they have a full stripe, thus
>avoiding the RAID4 "bang on the single parity disk" problem.  Couple
>that with a log-based filesystem, and they can suck down quite a bit of
>NFS traffic.  NetBSD already has RAIDFrame and LFS.  :)

>The thing that seems to be missing is NVRAM hardware support.  Last I
>asked about this on port-pmax two years ago, I was told that the NFS
>code would have to be significantly restructured to support DEC-style
>PrestoServe hardware.  So I assume the i386-specific advice offered by
>Jonathan is referring to NVRAM directly on the controller card?  Or are
>PrestoServe cards supported under -current?  I'm sure sun (and other
>platforms) have non-controller-connected NVRAM hardware...  this seems
>like something we could leverage across multiple platforms.

There are a couple of distinct issues here.

First (and relevant only to pmax and some vax users) is that the pmax
PrestoServe NVRAM boards plug directly into the host memory bus,
rather than into an I/O bus, like Sun sbus NVRAM cards (and later DEC
TC NVRAM cards for the TC Alphas) did.  That can affect throughput to
the NVram and driver structuring.  (IOW, we coud use two sorts of
nvram drivers: one for main-memory NVRAM which just does a bcopy(),
and one that uses bus_space to move the data to outboard NVRAM.  Or
one driver with two attachments.  You get the picture.)


Second is that the DEC-style NVRAM drivers are basically a shim
between filesystem code and the underlying disk drivers.  That means
NVRAM benefits both filesystems and applications like DBMSes (oracle,
informix, take your pick) which by-pass the filesystem and do writes
directly to raw disk devices. Simon Burge can testify to the benefits
of that approach. The downside is twofold: first, it burns twice as
many device numbers for the disks: one for the NVRAM-write-buffered
disk, and one for the `real' disk. Second, all disk drivers have to be
educated to look in the NVram for dirty-in-nvram blocks when doing
I/O to the non-NVRAM-buffered device. Otherwise you get consistency
errors between the disk and NVRAM.


The other approach is to put NVRAM smarts into the VFS layer (sorta
like softdeps).  That means you don't have to change all the disk
drivers, but it does mean that NVRAM only benefits VFS clients, not
datbases or hairy humungous datalogging applications.

I think cgd gave a good summary of the issues a year or two back. It
started on port-pmax and wound into tech-kern.  I dont recall if we
ever reached a conclusion about the `right' way to do this.
Check the archives, that's more accurate than my memory.

Followups on kernel design tradeoffs for nvram should probably go to
tech-kern as well.