NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/42661: Linux-emulated Veritas NetBackup fails to work in 5.0
>Number: 42661
>Category: kern
>Synopsis: Linux-emulated Veritas NetBackup fails to work in 5.0
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Jan 22 16:25:00 +0000 2010
>Originator: Havard Eidnes
>Release: NetBSD 5.0.1_PATCH
>Organization:
NORDUnet AS
>Environment:
System:
Architecture: i386
Machine: i386
>Description:
Well, the basic problem is that Veritas NetBackup (which is
only available in binary form, and we use the Linux version)
fails to work in NetBSD 5.0. It works fine in 4.0.
Because we run a Linux binary, we need to take special steps
to ensure that the entire /usr gets backed up, such that the
backup of /usr/lib ends up with the NetBSD libraries and not
the Linux-emulation libraries in /emul/linux/usr/lib instead.
So... Since we want to have all the file systems we should
back up under a common root, we need to re-mount the relevant
file systems somewhere, using some method.
We have tried two methods:
1) null mounts
2) NFS mounts
With null mounts in 4.0, we encountered a problem that after a
few days of run-time, all kernel memory was consumed, and if
my recollection is correct, it would basically seize up, so
that manual intervention via DDB was required to bring it back
to life. We therefore looked at alternatives, and ended up
with NFS mounts.
We have re-tried the null mounts, but the un-identified memory
leak problems appear to still be there in 5.0, so that's not a
usable method.
The NFS mount method has worked well in 4.0, but is giving us
problems in 5.0. After some debugging, we have found that one
of the two "bpbkar" processes end up in uvn_fp2 wait, most
probably while holding a lock, and fails to make any progress
beyond that point. New bpbkar processes (the backup server
initiates new ones on a schedule) leaves the new ones in
"tstile" state. The same does "df" processes, be they either
native or Linux-emulated.
Our most recent attempt at rebooting also got stuck in tstile
while unmounting one of the file systems, and here is some
selected output from the console log:
Jan 22 16:29:10 mail-server shutdown: reboot by he: new kernel
Jan 22 16:29:24 mail-server syslogd: Exiting on signal 15
syncing disks... 1 done
unmounting file systems...
unmounting /usr/pkg/emul/linux/netbackup/home (localhost:/home)...[halt sent]
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c05b2ecc cs 8 eflags 202 cr2 bb906538 ilevel 8
Stopped in pid 0.2 (system) at netbsd:breakpoint+0x4: popl %ebp
db{0}: ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
20756 1 3 1 4 e80c0d40 reboot tstile
6695 1 3 2 9020004 e4807580 bpbkar tstile
3952 1 3 2 9020004 e46bd280 bpbkar tstile
3081 1 3 2 9020004 e9408d20 df tstile
2519 1 3 1 9020004 d89250c0 df tstile
3006 1 3 1 9020004 d8898ca0 df tstile
17026 1 3 2 9020004 e4807800 bpbkar nfsrcv
5421 1 3 1 9020004 d89a07a0 bpbkar uvn_fp2
1 1 3 2 8020084 ce3bc840 init wait
0 73 3 0 204 e94080a0 ktrace ktrwait
72 3 0 204 e9d5eae0 ktrace ktrwait
68 3 1 204 d4744300 nfsio netio
67 3 2 204 d4744580 nfsio nfsrcv
66 3 1 204 d4744800 nfsio nfsrcv
65 3 0 204 d4744a80 nfsio nfsrcv
(why did it suddenly start indenting the ps listing at that
point?!?)
db{0}: trace/t 0t5421
trace: pid 5421 lid 1 at 0xd89c43cc
sleepq_block(0,0,c0aaba51,c0b27c80,0,c150a9ac,9,c2580910,da4a13a0,0) at
netbsd:sleepq_block+0xeb
mtsleep(c2580910,204,c0aaba51,0,da4a13a0,da4a13a0,10,6,0,0) at
netbsd:mtsleep+0x12d
uvn_findpage(d89c45ac,0,d89c44ac,c05343fa,0,0,2,0,994000,d89c45cc) at
netbsd:uvn_findpage+0x92
uvn_findpages(da4a13a0,24e60000,2,d89c45ec,d89c45ac,0,994000,20,2,0) at
netbsd:uvn_findpages+0x73
genfs_getpages(d89c46b0,0,0,0,0,24ed0000,0,0,2,d89c465c) at
netbsd:genfs_getpages+0x743
nfs_getpages(d89c46b0,4,24e62000,2,0,10000,24ee0000,c089d600,da4a13a0,24e60000)
at netbsd:nfs_getpages+0xbb
VOP_GETPAGES(da4a13a0,24e60000,2,d89c4750,d89c47c8,0,1,0,1802,0) at
netbsd:VOP_GETPAGES+0x65
uvn_get(da4a13a0,24e60000,2,d89c4750,d89c47c8,0,1,0,1802,d89a07a0) at
netbsd:uvn_get+0x117
ubc_fault(d89c48e0,d3981000,d89c48a0,1,0,1,42,246,8,c0bc8d04) at
netbsd:ubc_fault+0x170
uvm_fault_internal(c0bc21c0,d3981000,1,0,c262cfca,c0000,0,c05a6cfa,6,6) at
netbsd:uvm_fault_internal+0x3a9
trap() at netbsd:trap+0x797
--- trap (number 6) ---
copyout(d87906c0,d3981000,8249438,2000,d87906c0,0,d3981000,24e60000,2,d3981000)
at netbsd:copyout+0x33
uiomove(d3981000,2000,d89c4c8c,d89c4adc,0,101,deaddead,0,1829b58,0) at
netbsd:uiomove+0x62
ubc_uiomove(da4a13a0,d89c4c8c,10000,0,101,7c356d21,d89c4b2c,c085d206,da4945c0,da4a1440)
at netbsd:ubc_uiomove+0xeb
nfs_bioread(da4a13a0,d89c4c8c,0,ce3a6f00,0,da4a13a0,d89c4c2c,c053d6f4,d89c4c14,da4a13a0)
at netbsd:nfs_bioread+0x312
nfs_read(d89c4c14,da4a13a0,c089d3c0,da4a13a0,1,20001,d89c4c2c,c0534d58,c089ce80,da4a13a0)
at netbsd:nfs_read+0x43
VOP_READ(da4a13a0,d89c4c8c,0,ce3a6f00,d40a1040,0,9c4c6c,16,10000,8249438) at
netbsd:VOP_READ+0x44
vn_read(d8c4d940,d8c4d940,d89c4c8c,ce3a6f00,1,0,0,0,d89a07a0,d89c4d48) at
netbsd:vn_read+0x93
dofileread(9,d8c4d940,8249438,10000,d8c4d940,1,d89c4d28,d89c4d48,d89c4d48,d89a07a0)
at netbsd:dofileread+0x75
sys_read(d89a07a0,d89c4d10,d89c4d28,9c4d20,96,10,c0b4a744,9,8249438,10000) at
netbsd:sys_read+0x6f
linux_syscall(d89c4d48,2b,2b,2b,2b,610,8259338,bfbeec08,9,10000) at
netbsd:linux_syscall+0x9b
db{0}:
Now, inspection shows that the 5th argument to mtsleep is the
mutex it sleeps on, and that it's usable with "show lock" in
DDB:
db{0}: show lock 0xda4a13a0
lock address : 0x00000000da4a13a0 type : sleep/adaptive
initialized : 0x00000000c052b9c6
shared holds : 0 exclusive: 0
shares wanted: 0 exclusive: 0
current cpu : 0 last held: 1
current lwp : 0x00000000ce3a7c80 last held: 000000000000000000
last locked : 0x00000000c03d3f4c unlocked : 0x00000000c03d403b
owner field : 000000000000000000 wait/spin: 0/0
Turnstile chain at 0xc150ba80.
=> No active turnstile for this lock.
db{0}:
The "last locked" and "unlocked" values are:
db{0}: x/i 0x00000000c03d3f4c
netbsd:nfs_sync+0x7c: cmpl $0x3,0xc(%ebp)
db{0}: x/i 0x00000000c03d403b
netbsd:nfs_sync+0x16b: jmp netbsd:nfs_sync+0x44
db{0}:
Now, the way I read the "show lock" output, this lock is
currently not held, while the "bpbkar" process is still
waiting on it. That may be the reason that process is not
making any progress.
Now, as to the root cause of this problem, I have no idea, and
would like further input to narrow down on the root cause.
>How-To-Repeat:
Try to use Linux-emulated Veritas NetBackup together with NFS
mounted file systems to be backed up, and watch it get stuck.
>Fix:
Sorry, no idea -- request help for digging further.
>Unformatted:
Home |
Main Index |
Thread Index |
Old Index