Subject: possible UFS filesystem problem
To: None <port-alpha@netbsd.org>
From: Paul H. Anderson <pha@pdq.com>
List: port-alpha
Date: 04/15/1999 11:19:07
I have 8 alphas behind an alpha gateway machine acting as a cluster
computer.

The system as a whole is an experimental NIH funded supercomputer cluster
designed to do interactive access to very large social science datasets,
such as the US Census bureau data.  If we can make this work, we want to
expand the system size by a factor of four to eight.  I want to use alphas
because the problem space is heavily contrained by available VM space, and
32 bit address spaces just don't cut it anymore.

I installed snapshot 19990122 on all of them on an internal IDE disk.  The
file /etc/shells is contained in a UFS on that IDE disk.  There is a pair
of NFS mounted filesystems, but they appear not to be related to any
faults I've seen so far.  None of the cluster machines export an NFS
filesystem.

Among other configuration changes I made after getting them running was
installing /usr/local/bin/tcsh, and updating /etc/shells by doing:
echo "/usr/local/bin/tcsh" >>/etc/shells
on each one as root.

Yesterday, after what appeared to be a normal reboot on one of them, the
string: "/usr/local/bin/tcsh\n" (minus the quote marks) was replaced by
NULL characters ('\0').  This caused inbound ftp to fail, which is what
made it obvious to find, since I was testing failures in ftp performance.

This tells me that the mechanism for flushing dirty buffer pool pages to
the IDE disk has problems.

The buffer pool is reported as being:
real mem = 1073741824 (1941504 reserved for PROM, 1071800320 used by NetBSD)
avail mem = 936312832
using 13083 buffers containing 107175936 bytes of memory

therefore, it is somewhat larger than most machines have.


My question is this: has anyone seen this failure mode before?  So far
I've only seen it once, but it concerns me a lot, as I need pretty strong
guarantees of filesystem coherency.

My definition of strong guarantees is such that local file services be
100% robust under all load conditions short of kernel panic or hardware
failure.  That doesn't describe the above mentioned situation.

Please advise about suggestions for repeatability, logging, problem
reporting as needed.  Thanks!

Paul

+------------------------------------------------------+
| Paul Anderson           Public Data Queries, Inc.    |
| pha@pdq.com             734-213-4964(W) 994-3734(H)  |
+------------------------------------------------------+