NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected



Hi again,

the attached patch for /usr/src/sys/dev/dkwedge/dk.c seems to fix the problem.
No panic occure anymore in reboot.

But someone with detailed knowledge of the mutex order and the reference management of the wedges-stuff should have a look at it prior integration into the source tree.

best regards

W. Stukenbrock

Wolfgang Stukenbrock wrote:

The following reply was made to PR kern/46879; it has been noted by GNATS.

From: Wolfgang Stukenbrock <wolfgang.stukenbrock%nagler-company.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: Subject: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
Date: Fri, 31 Aug 2012 12:36:21 +0200

 Hi again,
I've had an additional look at it ... The problem seems to be in src/dev/dkwedges/dk.c I do not understand the whole semantic of the locks (dk_openlock and kd_rawlock), but I've recognised the following: in dk_close() the dk_openlock is entered on the wedge and if a close of for the parent is needed the dk_rawlock is allocated for the parent - no dk_openlock on parrent aquired.
 The same is done in dkwedge_del.
In both functions vn_close() for the dk_rawvp is called with tzhe dk_openlock held on the wedge. Here the dkclose() on the mounted wedge will call raidclose() of the underlying raid-device. If doing_shutdown is set, the raid-device gets destroyed and dkwedge_delall() is called for the raid-device. dkwedge_delall now copies the information of the first wedge - if any - into a local buffer and calls dkwedge_del in order to destroy it. This will enter the the dk_openlock on that wedge again, but we still hold it from dkclose() before. This would mean, that the panic is not related to a layerd raid-device, as I expected before - it will happen for every raiddevice with a wedge on it. I've tested this - BSD-label on sd0 and sd1,
 a raiddevice (stripe in this case) of sd0 and sd1,
 a GPT-label with one wedge on this raiddevice.
If the filesystem on the wedge is mounted when a reboot occures I get the following panic (and trace) in DDB: syncing disks... done unmounting file systems...Mutex error: mutex_vector_enter: locking against myself lock address : 0xffff80002f9ffb70
 current cpu  :                 10
 current lwp  : 0xffff800087b7b000
 owner field  : 0xffff800087b7b000 wait/spin:                0/0
panic: lock error
 fatal breakpoint trap in supervisor mode
trap type 1 code 0 rip ffffffff804b1045 cs 8 rflags 246 cr2 7f7ffd620030 cpl 0 rsp ffff800088062530
 Stopped in pid 594.1 (reboot) at        netbsd:breakpoint+0x5:  leave
 db{10}> trace
 breakpoint() at netbsd:breakpoint+0x5
 panic() at netbsd:panic+0x24d
 lockdebug_abort() at netbsd:lockdebug_abort+0x42
 mutex_vector_enter() at netbsd:mutex_vector_enter+0x208
 dkwedge_del() at netbsd:dkwedge_del+0x181
 dkwedge_delall() at netbsd:dkwedge_delall+0x65
 raidclose() at netbsd:raidclose+0x133
 bdev_close() at netbsd:bdev_close+0x89
 spec_close() at netbsd:spec_close+0x231
 VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
 vn_close() at netbsd:vn_close+0x51
 dkclose() at netbsd:dkclose+0xcb
 bdev_close() at netbsd:bdev_close+0x89
 spec_close() at netbsd:spec_close+0x231
 VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
 ffs_unmount() at netbsd:ffs_unmount+0x11e
 VFS_UNMOUNT() at netbsd:VFS_UNMOUNT+0x2e
 dounmount() at netbsd:dounmount+0xd5
 vfs_unmountall() at netbsd:vfs_unmountall+0x55
 cpu_reboot() at netbsd:cpu_reboot+0x100
 sys_reboot() at netbsd:sys_reboot+0x5f
 syscall() at netbsd:syscall+0xa0
 db{10}>
Yes - my analyses above seems to be correct. I would tend to increase the level from serious to critical, because a wedge on a raidframe-device is not usable! I'm not sure if it would be the correct sollution to release the dk_openlock mutex prior calling vn_close() in both functions mentioned above. I'm not shure if it would be possible to set dk_rawvp to NULL prior calling vpn_close() on it - may be stored in a temp-variable. If the answer to thees questions is yes, than that would be the sollution. Can someone with more knowledge on the mutex order and the semantic of dk_rawvp have a look on this topic.
 Thanks in advance
gnats-admin%NetBSD.org@localhost wrote: > Thank you very much for your problem report.
 > It has the internal identification `kern/46879'.
 > The individual assigned to look at your
> report is: kern-bug-people. > > >>Category: kern
 >>Responsible:    kern-bug-people
 >>Synopsis:       panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
 >>Arrival-Date:   Thu Aug 30 16:10:00 +0000 2012
 >>
> >



--


Dr. Nagler & Company GmbH
Hauptstraße 9
92253 Schnaittenbach

Tel. +49 9622/71 97-42
Fax +49 9622/71 97-50

Wolfgang.Stukenbrock%nagler-company.com@localhost
http://www.nagler-company.com


Hauptsitz: Schnaittenbach
Handelregister: Amberg HRB
Gerichtsstand: Amberg
Steuernummer: 201/118/51825
USt.-ID-Nummer: DE 273143997
Geschäftsführer: Dr. Martin Nagler, Dr. Dr. Karl-Kuno Kunze

--- dk.c        2012/08/31 10:57:31     1.1
+++ dk.c        2012/08/31 11:02:46
@@ -432,6 +432,7 @@
 dkwedge_del(struct dkwedge_info *dkw)
 {
        struct dkwedge_softc *sc = NULL;
+       struct vnode *tmp_vp = NULL;
        u_int unit;
        int bmaj, cmaj, s;
 
@@ -480,15 +481,15 @@
                mutex_enter(&sc->sc_parent->dk_rawlock);
                if (sc->sc_parent->dk_rawopens-- == 1) {
                        KASSERT(sc->sc_parent->dk_rawvp != NULL);
-                       mutex_exit(&sc->sc_parent->dk_rawlock);
-                       (void) vn_close(sc->sc_parent->dk_rawvp, FREAD | FWRITE,
-                           NOCRED);
+                       tmp_vp = sc->sc_parent->dk_rawvp;
                        sc->sc_parent->dk_rawvp = NULL;
-               } else
-                       mutex_exit(&sc->sc_parent->dk_rawlock);
+               }
+               mutex_exit(&sc->sc_parent->dk_rawlock);
                sc->sc_dk.dk_openmask = 0;
        }
        mutex_exit(&sc->sc_dk.dk_openlock);
+       if (tmp_vp != NULL)
+               (void) vn_close(tmp_vp, FREAD | FWRITE, NOCRED);
 
        /* Announce our departure. */
        aprint_normal("%s at %s (%s) deleted\n", device_xname(sc->sc_dev),
@@ -964,7 +965,7 @@
 dkclose(dev_t dev, int flags, int fmt, struct lwp *l)
 {
        struct dkwedge_softc *sc = dkwedge_lookup(dev);
-       int error = 0;
+       struct vnode *tmp_vp = NULL;
 
        KASSERT(sc->sc_dk.dk_openmask != 0);
 
@@ -981,17 +982,17 @@
                mutex_enter(&sc->sc_parent->dk_rawlock);
                if (sc->sc_parent->dk_rawopens-- == 1) {
                        KASSERT(sc->sc_parent->dk_rawvp != NULL);
-                       mutex_exit(&sc->sc_parent->dk_rawlock);
-                       error = vn_close(sc->sc_parent->dk_rawvp,
-                           FREAD | FWRITE, NOCRED);
+                       tmp_vp = sc->sc_parent->dk_rawvp = NULL;
                        sc->sc_parent->dk_rawvp = NULL;
-               } else
-                       mutex_exit(&sc->sc_parent->dk_rawlock);
+               }
+               mutex_exit(&sc->sc_parent->dk_rawlock);
        }
 
        mutex_exit(&sc->sc_dk.dk_openlock);
 
-       return (error);
+       if (tmp_vp != NULL)
+               return vn_close(tmp_vp, FREAD | FWRITE, NOCRED);
+       return 0;
 }
 
 /*


Home | Main Index | Thread Index | Old Index