tech-kern: bad interactions between FS, UVM, ccd(4), and isp(4)

Subject: bad interactions between FS, UVM, ccd(4), and isp(4)
To: None <tech-kern@netbsd.org>
From: Erik E. Fair <fair@netbsd.org>
List: tech-kern
Date: 04/08/2004 01:42:35

I've spent the last 36 hours fighting limitations in NetBSD 1.6.2. The
problems I've had indicate deeper trouble in NetBSD's design that we need to
deal with.

I'm half-way through a reconfiguration of a large filesystem (103 GB). In
theory, this is simple: dump the FS, newfs the partition as desired, restore
the FS.

The system is a DigitalServer 3305 (nee AlphaServer 800 5/500), with 512MB
RAM, running NetBSD/alpha 1.6.2. The disks are 1x IBM DGHS 9 GB disk (root,
/usr, /var, swap), and 3x IBM DRHS 36 GB ccd'd into one big FS (/n) that is
served up with NFS. These disks are all attached to an on-board Qlogic 1040
UltraSCSI controller, on a 16-bit (wide) SCSI bus.

I added a Qlogic 1280 dual channel Ultra-2 LVD SCSI controller to the system,
and attached a 4x IBM DRHS 36 GB LVD SCSI box to that. I configured it as a
larger ccd(4), to hold the dump file from the other FS while I reconfigured
it. Everything is FFS.

I brought the system down to single user mode from multi-user mode after about
three weeks of operation as an NFS server. I mounted /n (my 103 GB FS)
read-only, noted that mount still said it was "NFS exported", so I unmounted
it, and started to dump it to a file on the 4x ccd(4). After two and a half
hours, the dump stalled, waiting on "netio", but still responding to TTY
signals like ^T (SIGINFO) and ^C (SIGINT).

Examination of the system state with top and systat indicated that I'd run
out of RAM for caching file data, and none was being released. UVM trouble?
I aborted the dump, and RAM utilization returned to normal. This was some kind
of interruptible deadlock. Very nasty: the implication is that a full dump of
an FS that big is impossible on a system with only 512MB of RAM.

I rebooted the system into single user mode to clean out any question of
"funny" state lying around, and restarted the dump. This time, it worked.
Just to double check that I'd gotten a dump file with no media errors in it,
I dd'd the dump file to /dev/null. No problems.

# dump 0u -b 128 -B 134537184 -f /backups/n.dump /dev/rccd0h
DUMP: Found /dev/rccd0h on /n in /etc/fstab
DUMP: Date of this level 0 dump: Tue Apr 6 22:53:26 2004
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rccd0h (/n) to /backups/n.dump
DUMP: Label: none
DUMP: mapping (Pass I) [regular files]
DUMP: mapping (Pass II) [directories]
DUMP: estimated 96540817 tape blocks on 0.72 tape(s).
DUMP: Volume 1 started at: Tue Apr 6 22:55:11 2004
DUMP: dumping (Pass III) [directories]
DUMP: dumping (Pass IV) [regular files]
[dump progress reports elided -eef]
DUMP: 97520575 tape blocks on 1 volume
DUMP: Volume 1 completed at: Wed Apr 7 02:16:55 2004
DUMP: Volume 1 took 3:21:44
DUMP: Volume 1 transfer rate: 8056 KB/s
DUMP: Date of this level 0 dump: Tue Apr 6 22:53:26 2004
DUMP: Date this dump completed: Wed Apr 7 02:16:55 2004
DUMP: Average transfer rate: 8056 KB/s
DUMP: level 0 dump on Tue Apr 6 22:53:26 2004
DUMP: Closing /backups/n.dump
DUMP: DUMP IS DONE
404.0u 3893.0s 3:23:29.38 35.1% 0+0k 65+2339io 36pf+1w
# cd /backups
# ls -las
total 97526601
8 drwxr-xr-x 2 root wheel 512 Apr 6 22:55 .
1 drwxr-xr-x 21 root wheel 512 Mar 17 02:22 ..
97526592 -rw-r--r-- 1 root wheel 99861004288 Apr 7 02:16 n.dump
# dd if=n.dump of=/dev/null bs=1m
95234+1 records in
95234+1 records out
99861004288 bytes transferred in 2839.948 secs (35162969 bytes/sec)
2.5u 2575.8s 47:20.00 90.7% 0+0k 1+1io 11pf+0w

At this point, I felt safe enough to newfs the 3x ccd. I also remounted the
4x ccd with the dump file to read-only (granted, that wouldn't have saved me
if I fat-fingered ccd1 when I meant ccd0 to newfs, but ...). The object of
this was to change the block/frag sizes to 65536/8192, and to cut down on the
number of inodes (which I'd only used 11% of, while the FS was nearly full).

With that done, it's time to restore. This is where the real trouble begins.

1st attempt: mount /n with async,noatime and start restore. I do this because
I use the MH mail UA, and there are bazillions of small files in amongst the
big files in that 99 GB backup file, and I'd rather not hit the disk for every
metadata update that this restore will entail. I want to get it done sooner
rather than later. The other reason I believe I can get away with this is
that everything is plugged into a 5kVA UPS.

Alas, restore quits after about 30 minutes because it can't allocate enough
RAM for some internal structure. The restore program should unlimit itself.

# restore rf /backups/n.dump

no space for string table
abort? [yn] abort? [yn] y
dump core? [yn] n
1080.2u 75.2s 30:14.59 63.6% 0+0k 413+252230io 1pf+0w
# limit
cputime unlimited
filesize unlimited
datasize 131072 kbytes
stacksize 2048 kbytes
coredumpsize unlimited
memoryuse 481840 kbytes
memorylocked 160613 kbytes
maxproc 160
openfiles 64
# limit datasize unlimited
# limit memoryuse unlimited
# limit
cputime unlimited
filesize unlimited
datasize 1048576 kbytes
stacksize 2048 kbytes
coredumpsize unlimited
memoryuse unlimited
memorylocked 160613 kbytes
maxproc 160
openfiles 64

2nd attempt: unlimit datasize and memoryuse and restart restore.

It runs for almost three hours, and then the isp(4) driver starts complaining
periodically about resource exhaustion (in retrospect, this was actually
SCSI-PI talking to me), without really telling me whether this resulted in a
permanent error (i.e. I/O not completed), or just a temporary error. After a
dozen or so of these coming at about one a minute, ccd(4) starts complaining
about errors in "component 0", so OK, the transfers did abort. So, I aborted
the restore.

# mount -o async,noatime /n
# cd /n
# restore rf /backups/n.dump
load: 0.76 cmd: restore 83 [runnable] 64.35u 61.66s 47% 95760k
load: 2.83 cmd: restore 83 [runnable] 1231.85u 209.03s 10% 155144k
load: 2.90 cmd: restore 83 [runnable] 1240.47u 286.51s 36% 155144k
load: 3.00 cmd: restore 83 [biowait] 1458.30u 2110.18s 41% 99936k
sd3(isp0:0:3:0): adapter resource shortage
sd2(isp0:0:2:0): adapter resource shortage
sd3(isp0:0:3:0): adapter resource shortage
sd1(isp0:0:1:0): adapter resource shortage
sd3(isp0:0:3:0): unable to allocate scsipi_xfer
ccd0: error 12 on component 2
sd3: not queued, error 12
sd3(isp0:0:3:0): unable to allocate scsipi_xfer
ccd0: error 12 on component 2
sd3: not queued, error 12
sd3(isp0:0:3:0): adapter resource shortage
sd1(isp0:0:1:0): unable to allocate scsipi_xfer
ccd0: error 12 on component 0
sd1: not queued, error 12
sd1(isp0:0:1:0): unable to allocate scsipi_xfer
ccd0: error 12 on component 0
sd1: not queued, error 12
sd1(isp0:0:1:0): adapter resource shortage
sd2(isp0:0:2:0): adapter resource shortage
sd3(isp0:0:3:0): adapter resource shortage
^Crestore interrupted, continue? [yn] n
1515.9u 2625.5s 2:42:12.61 42.5% 0+0k 70553+373033io 0pf+1w

At this point, I'd like to direct your attention to scsipi(9), specifically
the section on scsipi_done() which an HBA must call, and can call with
XS_RESOURCE_SHORTAGE which the man page says "mid-layer will retry the transfer
after a short delay". I'd argue that if an HBA has returned this code after
all the capabilities negotiation that goes on between an HBA driver and SCSI-PI
at initialization, something is Seriously Wrong. The SCSI-PI mid-layer ought
not to be overrunning an HBA, Ever.

Secondly, if an HBA does get overrun like this, I'd argue the mid-layer queue
ought to be frozen until such time as the HBA calls the mid-layer back again
to indicate that "resources" are below their high water mark (or have hit
their low water mark), and the queue can be unfrozen again. Pounding the HBA
driver before then seems pointless to me.

Now, note what I was doing: I had an FFS mounted async, and I was plowing 99
GB of data into it. My system has 0.5 GB of RAM. Our wonderful new UVM-backed
filesystem buffer cache lets us cache file data until we run out of RAM, but
when we do run out, we're going to start the thundering herd of data down the
SCSI channel, and while we hope it'll keep up with the flow, the filesystem
code (and/or the buffer cache flush code) should either block/sleep/suspend
when we've queued all the SCSI-PI layer and the SCSI HBA will allow.

Why does this backpressure appear to fail in the async mount case?

I've also seen a variant of this problem before on Sun SS20 class machines
while installing NetBSD binary/sets on async mounted FFS *without* ccd(4).
I'm going to try and set up a repeatable test case for that one, since I'm
not going to put my 99 GB "n.dump" file up for FTP.

3rd attempt: newfs /n (again), try mounting it softdep,noatime. Same bad
result as before, with some scsipi_xfer allocation failures, and to add insult
to injury, a softdep panic for my trouble, too.

# restore rf /backups/n.dump
load: 1.05 cmd: restore 89 [runnable] 951.37u 74.26s 99% 110624k
load: 2.88 cmd: restore 89 [runnable] 1296.50u 326.38s 28% 155144k
sd1(isp0:0:1:0): unable to allocate scsipi_xfer
ccd0: error 12 on component 0
sd1: not queued, error 12
sd1(isp0:0:1:0): unable to allocate scsipi_xfer
ccd0: error 12 on component 0
sd1: not queued, error 12
./nbsd/tmp/i386-mp/syssrc/sys/arch/i386/conf/CARDBUS: cannot create file: Cannot allocate memory
sd1(isp0:0:1:0): unable to allocate scsipi_xfer
ccd0: error 12 on component 0
sd1: not queued, error 12
sd1(isp0:0:1:0): unable to allocate scsipi_xfer
ccd0: error 12 on component 0
sd1: not queued, error 12
load: 3.21 cmd: restore 89 [runnable] 1527.95u 2347.37s 88% 147480k
/n: got error 12 while accessing filesystem
panic: softdep_deallocate_dependencies: unrecovered I/O error
syncing disks...

4th attempt: newfs /n (again), mounted with just noatime. So far, so good.
I note, however, that restore has no progress reporting whatever, and does
not respond to SIGINFO. For a multi-hour restore like this one, a percentage
done report every 5 to 10 minutes would be nice.

The thought I want to leave all of you with is that our filesystem code has
some poor behavior at the edges of performance, and some of the
non-deterministic memory management that we're doing is exacerbating the
problem. It would appear that standard FFS commit policy is so well behaved
that we never stress the rest of the I/O code below it in normal usage.

Unfortunately for me, that commit policy is pessimal in the case I've
documented above, but when I changed the policy to aggressively cache data
(async), everything went to hell. There needs to be definite
blocking/backpressure that will not cause I/O to fail utterly when there is
transient resource shortage, which means we really have to make this code
behave more deterministically.

Overcommit policies are fine when you're absolutely sure that some of what's
asked for won't actually get used (e.g. a few megs of pages in a multi-gigabyte
sparse RAM allocation), but they're deadly in I/O queues.

Erik <fair@netbsd.org>