kern/47030: Installation stalls during set extraction

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/47030: Installation stalls during set extraction
From: gson%gson.org@localhost (Andreas Gustafsson)
Date: Wed, 3 Oct 2012 16:20:00 +0000 (UTC)

>Number:         47030
>Category:       kern
>Synopsis:       Installation stalls during set extraction
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Oct 03 16:20:00 +0000 2012
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current
>Organization:
>Environment:
System: NetBSD
Architecture: sparc
Machine: sparc
>Description:

When doing automated NetBSD installs in a qemu VM using anita
(pkgsrc: misc/py-anita), the installation sometimes fails because the 
set extraction never completes.  When this happens, the sysinst progress
bar displays the message "- stalled -" and stays that way indefinitely:

     Status: Running
    Command: progress -zf /mnt2//sparc/binary/sets/base.tgz tar --chroot -xhepf

-
--------------------------------------------------------------------------------
  9% |***                                |  8552 KiB    5.06 KiB/s  - stalled -

For example, when qemu was upgraded from 0.12 to 0.15 on the TNF
automated testbed babylon5.netbsd.org, this problem started hitting
every SPARC install run.  Increasing the amount of memory in the
emulated SPARC system from 32M to 64M made the problem go away.

On an earlier occasion involving an i386 VM, I managed to attach to
the VM with gdb for remote kernel debugging and found that the "tar"
process extracting the sets was stuck waiting for a buffer:

  (gdb) where
  #0  0xc064e1ff in cv_timedwait (cv=0xc537a560, mtx=0x0, timo=-985761876) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/lwp.h:377
  #1  0xc0681c86 in sleepq_block (timo=0, catch=false) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/lwp.h:377
  #2  0xc064e276 in cv_timedwait (cv=0xc13b5120, mtx=0xc13f73fc, timo=0) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/lwp.h:377
  #3  0xc0a83dc3 in getnewbuf (slpflag=0, slptimeo=0, from_bufq=0) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/wapbl.h:209
  #4  0xc0a8366e in getblk (vp=0xc54e9380, blkno=0, size=2048, slpflag=0, 
slptimeo=0) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/wapbl.h:209
  #5  0xc0387f35 in ffs_getblk (vp=0xc54e9380, lblkno=0, blkno=3389092, 
size=2048, clearbuf=true, bpp=0xc53e7ad0) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/ufs/ffs/ffs_subr.c:122
  #6  0xc037d57f in ffs_balloc_ufs2 (vp=0xc54e9380, off=0, size=512, 
cred=0xc460bf00, flags=1, bpp=0xc53e7ad0) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/ufs/ffs/ffs_balloc.c:92
  #7  0xc037b7e8 in ffs_balloc (vp=0xc54e9380, off=0, size=512, 
cred=0xc460bf00, flags=1, bpp=0xc53e7ad0) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/ufs/ffs/ffs_balloc.c:88
  #8  0xc09f8297 in ufs_mkdir (v=0xc53e7b58) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/vnode.h:391
  #9  0xc0abbad9 in VOP_MKDIR (dvp=0xc54c78fc, vpp=0xc53e7c34, cnp=0xc53e7c48, 
vap=0xc53e7b9c) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/kern/vnode_if.c:86
  #10 0xc0a98cff in do_sys_mkdir (path=0x8235d30 <Address 0x8235d30 out of 
bounds>, mode=493, seg=UIO_USERSPACE) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/kern/vfs_syscalls.c:204
  #11 0xc0a98bb0 in sys_mkdir (l=dwarf2_read_address: Corrupted DWARF 
expression.
  ) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/kern/vfs_syscalls.c:204
  #12 0xc09614f7 in sy_call (sy=0xc0e978c0, l=0xc537a560, uap=0xc53e7cf0, 
rval=0xc53e7d18) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/syscallvar.h:61
  #13 0xc0961872 in syscall (frame=0xc53e7d48) at 
/bracket/i386-debug/work/2011.12.02.18.07.26/src/sys/sys/userret.h:84
  #14 0xc01005d6 in syscall1 ()

One one occasion when I had interactive terminal access to the sysinst
session, I was able to recover from the stall by suspending sysinst
with control-Z, maybe running some random shell commands, and then
continuing the the stopped sysinst job.

The following factors all seem to make it more likely for the problem to occur:

 - Extracting a large install set, such as "base"
 - Installing a release built with debug symbols (which makes the sets even 
larger)
 - A limited amount of memory in the qemu guest (e.g., 32M rather than 64M)
 - Installing on sparc rather than i386
 - Using qemu 0.15 rather than 0.12

My best guess as to why this happens is that the kernel somehow gets
into a deadlocked state where all the disk buffers are simultaneously
locked, leaving none available for allocation; this might be triggered
by some combination of a small memory size, lots of new files being
created in a short amount of time, changes in disk caching policy
between qemu 0.12 and 0.15, and a lack of other system activity, a
combination that is unlikely to occur during normal system operation
but likely during installation into a small-memory VM.

>How-To-Repeat:

The following reliably reproduces the problem for me; your mileage may vary.

  - Install qemu-0.15.1nb5 from the emulators/qemu0 package (not qemu-1
    from emulators/qemu, because it has unrelated bugs that may prevent
    you from reaching the set extraction stage)

  - Install anita-1.29 from the misc/py-anita package

  - Run the command

      anita --memory-size 32M install 
http://ftp.netbsd.org/pub/NetBSD/iso/6.0_RC2/NetBSD-6.0_RC2-sparc.iso

>Fix:

Prev by Date: Re: install/47021: 6.0 RC2 sysinst not find any hard disks on Powermac G3 B&W
Next by Date: kern/47031: bwi(4) firmware not obtainable
Previous by Thread: kern/47029: We need shm_open() and shm_unlink()
Next by Thread: Re: kern/47030: Installation stalls during set extraction
Indexes:

Home | Main Index | Thread Index | Old Index