kern/42661: Linux-emulated Veritas NetBackup fails to work in 5.0

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/42661: Linux-emulated Veritas NetBackup fails to work in 5.0
From: he%nordu.net@localhost
Date: Fri, 22 Jan 2010 16:25:00 +0000 (UTC)
>Number:         42661
>Category:       kern
>Synopsis:       Linux-emulated Veritas NetBackup fails to work in 5.0
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jan 22 16:25:00 +0000 2010
>Originator:     Havard Eidnes
>Release:        NetBSD 5.0.1_PATCH
>Organization:
        NORDUnet AS
>Environment:
        
        
System: 
Architecture: i386
Machine: i386
>Description:
        Well, the basic problem is that Veritas NetBackup (which is
        only available in binary form, and we use the Linux version)
        fails to work in NetBSD 5.0.  It works fine in 4.0.

        Because we run a Linux binary, we need to take special steps
        to ensure that the entire /usr gets backed up, such that the
        backup of /usr/lib ends up with the NetBSD libraries and not
        the Linux-emulation libraries in /emul/linux/usr/lib instead.

        So...  Since we want to have all the file systems we should
        back up under a common root, we need to re-mount the relevant
        file systems somewhere, using some method.

        We have tried two methods:

        1) null mounts
        2) NFS mounts

        With null mounts in 4.0, we encountered a problem that after a
        few days of run-time, all kernel memory was consumed, and if
        my recollection is correct, it would basically seize up, so
        that manual intervention via DDB was required to bring it back
        to life.  We therefore looked at alternatives, and ended up
        with NFS mounts.

        We have re-tried the null mounts, but the un-identified memory
        leak problems appear to still be there in 5.0, so that's not a
        usable method.

        The NFS mount method has worked well in 4.0, but is giving us
        problems in 5.0.  After some debugging, we have found that one
        of the two "bpbkar" processes end up in uvn_fp2 wait, most
        probably while holding a lock, and fails to make any progress
        beyond that point.  New bpbkar processes (the backup server
        initiates new ones on a schedule) leaves the new ones in
        "tstile" state.  The same does "df" processes, be they either
        native or Linux-emulated.

        Our most recent attempt at rebooting also got stuck in tstile
        while unmounting one of the file systems, and here is some
        selected output from the console log:

Jan 22 16:29:10 mail-server shutdown: reboot by he: new kernel 
Jan 22 16:29:24 mail-server syslogd: Exiting on signal 15
syncing disks... 1 done
unmounting file systems...
unmounting /usr/pkg/emul/linux/netbackup/home (localhost:/home)...[halt sent]
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c05b2ecc cs 8 eflags 202 cr2 bb906538 ilevel 8
Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x4:  popl    %ebp
db{0}: ps      
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
20756    1 3   1         4           e80c0d40             reboot tstile
6695     1 3   2   9020004           e4807580             bpbkar tstile
3952     1 3   2   9020004           e46bd280             bpbkar tstile
3081     1 3   2   9020004           e9408d20                 df tstile
2519     1 3   1   9020004           d89250c0                 df tstile
3006     1 3   1   9020004           d8898ca0                 df tstile
17026    1 3   2   9020004           e4807800             bpbkar nfsrcv
5421     1 3   1   9020004           d89a07a0             bpbkar uvn_fp2
1        1 3   2   8020084           ce3bc840               init wait
0       73 3   0       204           e94080a0             ktrace ktrwait
              72 3   0       204           e9d5eae0             ktrace ktrwait
              68 3   1       204           d4744300              nfsio netio
              67 3   2       204           d4744580              nfsio nfsrcv
              66 3   1       204           d4744800              nfsio nfsrcv
              65 3   0       204           d4744a80              nfsio nfsrcv

        (why did it suddenly start indenting the ps listing at that
        point?!?)

db{0}: trace/t 0t5421
trace: pid 5421 lid 1 at 0xd89c43cc
sleepq_block(0,0,c0aaba51,c0b27c80,0,c150a9ac,9,c2580910,da4a13a0,0) at 
netbsd:sleepq_block+0xeb
mtsleep(c2580910,204,c0aaba51,0,da4a13a0,da4a13a0,10,6,0,0) at 
netbsd:mtsleep+0x12d
uvn_findpage(d89c45ac,0,d89c44ac,c05343fa,0,0,2,0,994000,d89c45cc) at 
netbsd:uvn_findpage+0x92
uvn_findpages(da4a13a0,24e60000,2,d89c45ec,d89c45ac,0,994000,20,2,0) at 
netbsd:uvn_findpages+0x73
genfs_getpages(d89c46b0,0,0,0,0,24ed0000,0,0,2,d89c465c) at 
netbsd:genfs_getpages+0x743
nfs_getpages(d89c46b0,4,24e62000,2,0,10000,24ee0000,c089d600,da4a13a0,24e60000) 
at netbsd:nfs_getpages+0xbb
VOP_GETPAGES(da4a13a0,24e60000,2,d89c4750,d89c47c8,0,1,0,1802,0) at 
netbsd:VOP_GETPAGES+0x65
uvn_get(da4a13a0,24e60000,2,d89c4750,d89c47c8,0,1,0,1802,d89a07a0) at 
netbsd:uvn_get+0x117
ubc_fault(d89c48e0,d3981000,d89c48a0,1,0,1,42,246,8,c0bc8d04) at 
netbsd:ubc_fault+0x170
uvm_fault_internal(c0bc21c0,d3981000,1,0,c262cfca,c0000,0,c05a6cfa,6,6) at 
netbsd:uvm_fault_internal+0x3a9
trap() at netbsd:trap+0x797
--- trap (number 6) ---
copyout(d87906c0,d3981000,8249438,2000,d87906c0,0,d3981000,24e60000,2,d3981000) 
at netbsd:copyout+0x33
uiomove(d3981000,2000,d89c4c8c,d89c4adc,0,101,deaddead,0,1829b58,0) at 
netbsd:uiomove+0x62
ubc_uiomove(da4a13a0,d89c4c8c,10000,0,101,7c356d21,d89c4b2c,c085d206,da4945c0,da4a1440)
 at netbsd:ubc_uiomove+0xeb
nfs_bioread(da4a13a0,d89c4c8c,0,ce3a6f00,0,da4a13a0,d89c4c2c,c053d6f4,d89c4c14,da4a13a0)
 at netbsd:nfs_bioread+0x312
nfs_read(d89c4c14,da4a13a0,c089d3c0,da4a13a0,1,20001,d89c4c2c,c0534d58,c089ce80,da4a13a0)
 at netbsd:nfs_read+0x43
VOP_READ(da4a13a0,d89c4c8c,0,ce3a6f00,d40a1040,0,9c4c6c,16,10000,8249438) at 
netbsd:VOP_READ+0x44
vn_read(d8c4d940,d8c4d940,d89c4c8c,ce3a6f00,1,0,0,0,d89a07a0,d89c4d48) at 
netbsd:vn_read+0x93
dofileread(9,d8c4d940,8249438,10000,d8c4d940,1,d89c4d28,d89c4d48,d89c4d48,d89a07a0)
 at netbsd:dofileread+0x75
sys_read(d89a07a0,d89c4d10,d89c4d28,9c4d20,96,10,c0b4a744,9,8249438,10000) at 
netbsd:sys_read+0x6f
linux_syscall(d89c4d48,2b,2b,2b,2b,610,8259338,bfbeec08,9,10000) at 
netbsd:linux_syscall+0x9b
db{0}: 

        Now, inspection shows that the 5th argument to mtsleep is the
        mutex it sleeps on, and that it's usable with "show lock" in
        DDB:

db{0}: show lock 0xda4a13a0
lock address : 0x00000000da4a13a0 type     :     sleep/adaptive
initialized  : 0x00000000c052b9c6
shared holds :                  0 exclusive:                  0
shares wanted:                  0 exclusive:                  0
current cpu  :                  0 last held:                  1
current lwp  : 0x00000000ce3a7c80 last held: 000000000000000000
last locked  : 0x00000000c03d3f4c unlocked : 0x00000000c03d403b
owner field  : 000000000000000000 wait/spin:                0/0

Turnstile chain at 0xc150ba80.
=> No active turnstile for this lock.
db{0}: 

        The "last locked" and "unlocked" values are:

db{0}: x/i 0x00000000c03d3f4c
netbsd:nfs_sync+0x7c:   cmpl    $0x3,0xc(%ebp)
db{0}: x/i 0x00000000c03d403b
netbsd:nfs_sync+0x16b:  jmp     netbsd:nfs_sync+0x44
db{0}:

        Now, the way I read the "show lock" output, this lock is
        currently not held, while the "bpbkar" process is still
        waiting on it.  That may be the reason that process is not
        making any progress.

        Now, as to the root cause of this problem, I have no idea, and
        would like further input to narrow down on the root cause.


        
>How-To-Repeat:
        Try to use Linux-emulated Veritas NetBackup together with NFS
        mounted file systems to be backed up, and watch it get stuck.
        
>Fix:
        Sorry, no idea -- request help for digging further.

>Unformatted:
Prev by Date: Re: port-amd64/42631: Booting 5.0.1/amd64 (or 5.0.1/i386) install kernel gets page fault on Dell Inspiron 545s
Next by Date: kern/42662: iop driver makes kernel crash or freeze
Previous by Thread: Re: kern/42656: netbsd-5: panic LOCKDEBUG in opencrypto(9)
Next by Thread: kern/42662: iop driver makes kernel crash or freeze
Indexes:
Home | Main Index | Thread Index | Old Index