NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/54541: kernel panic using "zfs diff"



Patrick Welche <prlw1%cam.ac.uk@localhost> writes:

> The following reply was made to PR kern/54541; it has been noted by GNATS.
>
> From: Patrick Welche <prlw1%cam.ac.uk@localhost>
> To: gnats-bugs%netbsd.org@localhost
> Cc: 
> Subject: Re: kern/54541: kernel panic using "zfs diff"
> Date: Fri, 11 Oct 2019 17:01:34 +0100
>
>  On Wed, Oct 09, 2019 at 05:05:01PM +0000, Christos Zoulas wrote:
>  >  Something seems to not understand that this is a hijacked fd which it seems
>  >  to be: 133 (128 + 5)...
>  
>  /dev/zfs is hijacked:
>  export RUMPHIJACK=blanket=/dev/zfs:/dk:/storage,sysctl=yes,modctl=yes
>  
>  rumpns_fd_getfile receives the 133 rather than 5 so complains.
>  So I seem to be seeing a rump issue rather than a zfs issue?
>  

I would say that there is a pretty good chance that rump does not quite
handle ZFS correctly.  I won't speculate as to why.

Back to the original kernel dump... This is pretty simple for me to
reproduce this as well... this is a snip I get:

[ 472375.2036056] panic: kernel diagnostic assertion "fdm != NULL" failed: file "/usr/src/sys/kern/vfs_trans.c", line 166 mount 0x0 invalid
.
.
.
[ 472375.2036056] vn_rdwr() at netbsd:vn_rdwr+0x136
[ 472375.2036056] write_record.part.1() at zfs:write_record.part.1+0x54
[ 472375.2036056] diff_cb() at zfs:diff_cb+0x236
[ 472375.2036056] traverse_visitbp() at zfs:traverse_visitbp+0x1b6
[ 472375.2036056] traverse_visitbp() at zfs:traverse_visitbp+0x52b
[ 472375.2036056] traverse_visitbp() at zfs:traverse_visitbp+0x52b
[ 472375.2036056] traverse_visitbp() at zfs:traverse_visitbp+0x52b
[ 472375.2036056] traverse_visitbp() at zfs:traverse_visitbp+0x52b
[ 472375.2036056] traverse_visitbp() at zfs:traverse_visitbp+0x52b
[ 472375.2036056] traverse_dnode() at zfs:traverse_dnode+0xda
[ 472375.2036056] traverse_visitbp() at zfs:traverse_visitbp+0x8ab
[ 472375.2036056] traverse_impl() at zfs:traverse_impl+0x16c
[ 472375.2036056] traverse_dataset_resume() at zfs:traverse_dataset_resume+0x44
[ 472375.2036056] dmu_diff() at zfs:dmu_diff+0x14c

The write_record call is in
src/external/cddl/osnet/dist/uts/common/fs/zfs/dmu_diff.c and it is
pretty small.  It might be interesting to know what the arguments to the
single vn_rdwr call are.

I won't have time right now to find this out for myself, however....

Recursion is involved in all of this, that is what the traverse_visitbp
stuff is all about that is mentioned in the panic messages, and I wonder
if there is a missing or mishandled terminator condition.  The panic
itself, in my case, is tripped by a DIAGNOSTIC assert check in a VOP
function.  It is a little confusing, but diff_cb is a call back (of some
sort) that appears to be set up by a call to traverse_dataset which gets
translated in the panic as traverse_dataset_resume (I think).

I can only run this on a DOMU, so no kernel dumps, but I suspect that if
one could get a clean kernel dump somewhere else it would all become
clear what is going on.





-- 
Brad Spencer - brad%anduin.eldar.org@localhost - KC8VKS - http://anduin.eldar.org



Home | Main Index | Thread Index | Old Index