Subject: bin/30816: dump(8) broken for larger values of blocking
To: None <gnats-admin@netbsd.org, netbsd-bugs@netbsd.org>
From: None <blymn@baea.com.au>
List: netbsd-bugs
Date: 07/23/2005 14:21:01
>Number:         30816
>Category:       bin
>Synopsis:       large blocking factors cannot be used with dump
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jul 23 14:21:01 +0000 2005
>Originator:     Brett Lymn (Master of the Siren)
>Release:        NetBSD 3.99.6
>Organization:
Brett Lymn
>Environment:
System: NetBSD siren 3.99.6 NetBSD 3.99.6 (SIREN.ACPI.MP) #10: Sun Jul 17 19:29:12 CST 2005 toor@siren:/usr/src/sys/arch/amd64/compile/SIREN.ACPI.MP amd64
Architecture: x86_64
Machine: amd64
>Description:
	The b option of dump(8) may have a value of between 1 and 1000
according to the usage message from dump.  If a blocksize above about
200 is used then dump misbehaves in various ways, either looping
indefinitely or quitting with a "master/slave protocol botched" whilst
pass III is being done.  It seems the larger b is the more likely you
get the master/slave protocol botched message, values near 256 result
in a hang due to an infinite loop in tape.c:doslave(), for some reason
p->count is zero which causes the first for loop in doslave() to
never terminate.

>How-To-Repeat:
	I was dumping a 40Gb partition to a DLT40 tape drive using a 
blocksize of 512, this resulted in dump hanging during pass III of the
dump.  The machine was up multi-user but the filesystem in question does
fsck clean (i.e. this problem is not due to attempting to back up a
corrupt fs)

>Fix:
	The problem can be worked around by using a lower blocking size at
the expense of the tape drive not streaming, a blocksize of 128 appears to
work reliably.  I had a look at the code and there is only one place that
the request count could be zero and that is in tape.c:flushtape() where it
is deliberately zeroed and a comment of "Sentinel" is next to this statement.
This "sentinel" state does not seem to be checked anywhere in the code.