Subject: bin/24453: dump(8) hangs
To: None <gnats-bugs@gnats.NetBSD.org>
From: None <apb@cequrux.com>
List: netbsd-bugs
Date: 02/17/2004 12:18:02
>Number:         24453
>Category:       bin
>Synopsis:       dump(8) hangs
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Feb 17 10:20:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator:     Alan Barrett
>Release:        NetBSD 1.6ZK
>Organization:
Not much
>Environment:
Architecture: i386
Machine: i386
NetBSD-1.6ZK, kernel built from sources checked out from CVS on 2004-02-15,
userland built from sources about a week older than that.
>Description:
dump(8) often just hangs.  I have marked this PR critical/high
because it is preventing backups from being made.

>How-To-Repeat:
The problem is intermittent; small file systems seem to dump fine most
of the time, while large file systems seem to exhibit the problem much
more often.

Here's an example of a failure.

/dev/raid0i is an FFSv1+softdep file system on a RAID1 disk.  I have not
run "fsck_ffs -c 4" on this file system.  Here is some of the output from
dumpfs:

    file system: /dev/rraid0i
    endian  little-endian
    magic   11954 (UFS1)    time    Tue Feb 17 10:40:42 2004
    cylgrp  dynamic inodes  4.4BSD  sblock  FFSv1   fslevel 3
    nbfree  98352   ndir    119718  nifree  281416  nffree  112751
    ncg     51      ncyl    16384   size    4194304 blocks  4128607
    bsize   16384   shift   14      mask    0xffffc000
    fsize   2048    shift   11      mask    0xfffff800
    frag    8       shift   3       fsbtodb 2
    cpg     324     bpg     10368   fpg     82944   ipg     20352
    minfree 5%      optim   space   maxcontig 4     maxbpg  4096
    rotdelay 0ms    headswitch 0us  trackseek 0us   rps     60
    ntrak   8       nsect   128     npsect  128     spc     1024
    symlinklen 60   trackskew 0     interleave 1    contigsumsize 4
    maxfilesize 0x000400400402ffff
    nindir  4096    inopb   128     nspf    4
    avgfilesize 16384       avgfpdir 64
    sblkno  8       cblkno  16      iblkno  24      dblkno  1296
    sbsize  2048    cgsize  16384   offset  32      mask    0xfffffff8
    csaddr  1296    cssize  2048    shift   10      mask    0xfffffc00
    cgrotor 0       fmod    0       ronly   0       clean   0x02
    flags   soft-updates
    fsmnt   /r0i

When I run "dump 0ushf 1048576 0 - /dev/rraid0i | dd bs=1m progress=1
of=/dev/null", I get the following output within the first 10 minutes
(interspersed with the dots from "dd progress=1", not shown).

  DUMP: Found /dev/rraid0i on /r0i in /etc/fstab
  DUMP: Date of this level 0 dump: Tue Feb 17 10:07:30 2004
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping /dev/rraid0i (/r0i) to standard output
  DUMP: Label: none
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 6566086 tape blocks.
  DUMP: Volume 1 started at: Tue Feb 17 10:07:41 2004
  DUMP: dumping (Pass III) [directories]
  DUMP: 4.73% done, finished in 1:40
  DUMP: dumping (Pass IV) [regular files]
  DUMP: 18.61% done, finished in 0:43

About two minutes after the message about "18.61% done", the dots from
"dd progress=1" stop being printed.  Ten minutes later, "dd" still
hasn't printed another dot.  At this point, pressing control-T causes dd
and dump to report the following:

  0+145365 records in
  0+145365 records out
  1488537600 bytes transferred in 1413.092 secs (1053390 bytes/sec)
  22.14% done at 1036 KB/s, finished in 1:22

Also around this time, "pstree -ap | grep dump" shows the following:

  |           |       `-dump,9699) 0ushf 1048576 0 - /dev/rraid0i
  |           |           `-dump,8577) 0ushf 1048576 0 - /dev/rraid0i
  |           |               |-dump,21473) 0ushf 1048576 0 - /dev/rraid0i
  |           |               |-dump,4474) 0ushf 1048576 0 - /dev/rraid0i
  |           |               `-dump,19390) 0ushf 1048576 0 - /dev/rraid0i

and "ps -axlww | grep dump | grep -v grep" shows the following:

   0  4474  8577  27  18  0   672      4 pause    SW+  pd  0:45.36 dump 0ushf 1048576 0 - /dev/rraid0i
   0  8577  9699   0   2  0   728      4 netio    SW+  pd  0:15.62 dump 0ushf 1048576 0 - /dev/rraid0i
   0  9699 14660  26  10  0   672      4 wait     SW+  pd  0:03.09 dump 0ushf 1048576 0 - /dev/rraid0i
   0 19390  8577  30  18  0   672      4 pause    SW+  pd  0:44.49 dump 0ushf 1048576 0 - /dev/rraid0i
   0 21473  8577  23  18  0   672      4 pause    SW+  pd  0:45.03 dump 0ushf 1048576 0 - /dev/rraid0i

I tried to use "ktruss -d -i -p 9699" to see what was happening, but
ktruss printed nothing for more than an hour.  Eventually, I pressed
control-T again, and got the following output from ktruss:

    SIGINFO caught handler=0x804d974 mask=0x0 code=0x0
      8577 dump     gettimeofday(0xbfbfee18, 0)        = 0
      8577 dump     emul(netbsd)
      8577 dump     write(0x2, 0xbfbfee80, 0x2a)       = 42
	   "22.14% done at 229 KB/s, finished in 6:11\n"
      8577 dump     compat_16___sigreturn14(0xbfbfef2c) JUSTRETURN

along with the following from dd and dump:

    0+145365 records in
    0+145365 records out
    1488537600 bytes transferred in 6348.986 secs (234452 bytes/sec)
    22.14% done at 229 KB/s, finished in 6:11

>Fix:
Not known.
>Release-Note:
>Audit-Trail:
>Unformatted: