Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,Wolfgang.Stukenbrock%nagler-company.com@localhost
Subject: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
From: Wolfgang Stukenbrock <wolfgang.stukenbrock%nagler-company.com@localhost>
Date: Fri, 31 Aug 2012 13:05:06 +0000 (UTC)

The following reply was made to PR kern/46879; it has been noted by GNATS.

From: Wolfgang Stukenbrock <wolfgang.stukenbrock%nagler-company.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, 
netbsd-bugs%NetBSD.org@localhost
Subject: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock 
detected
Date: Fri, 31 Aug 2012 15:04:16 +0200

 This is a multi-part message in MIME format.
 --------------060607080304090807080703
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 Content-Transfer-Encoding: 8bit
 
 Hi again,
 
 the attached patch for /usr/src/sys/dev/dkwedge/dk.c seems to fix the 
 problem.
 No panic occure anymore in reboot.
 
 But someone with detailed knowledge of the mutex order and the reference 
 management of the wedges-stuff should have a look at it prior 
 integration into the source tree.
 
 best regards
 
 W. Stukenbrock
 
 Wolfgang Stukenbrock wrote:
 
 > The following reply was made to PR kern/46879; it has been noted by GNATS.
 > 
 > From: Wolfgang Stukenbrock 
 > <wolfgang.stukenbrock%nagler-company.com@localhost>
 > To: gnats-bugs%NetBSD.org@localhost
 > Cc: 
 > Subject: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock 
 > detected
 > Date: Fri, 31 Aug 2012 12:36:21 +0200
 > 
 >  Hi again,
 >  
 >  
 >  I've had an additional look at it ...
 >  
 >  The problem seems to be in src/dev/dkwedges/dk.c
 >  
 >  I do not understand the whole semantic of the locks (dk_openlock and 
 >  kd_rawlock), but I've recognised the following:
 >  
 >  in dk_close() the dk_openlock is entered on the wedge
 >  and if a close of for the parent is needed the dk_rawlock is allocated 
 >  for the parent - no dk_openlock on parrent aquired.
 >  The same is done in dkwedge_del.
 >  In both functions vn_close() for the dk_rawvp is called with tzhe 
 >  dk_openlock held on the wedge.
 >  
 >  Here the dkclose() on the mounted wedge will call raidclose() of the 
 >  underlying raid-device.
 >  If doing_shutdown is set, the raid-device gets destroyed and 
 >  dkwedge_delall() is called for the raid-device.
 >  dkwedge_delall now copies the information of the first wedge - if any - 
 >  into a local buffer and calls dkwedge_del in order to destroy it.
 >  This will enter the the dk_openlock on that wedge again, but we still 
 >  hold it from dkclose() before.
 >  
 >  This would mean, that the panic is not related to a layerd raid-device, 
 >  as I expected before - it will happen for every raiddevice with a wedge 
 >  on it.
 >  
 >  I've tested this - BSD-label on sd0 and sd1,
 >  a raiddevice (stripe in this case) of sd0 and sd1,
 >  a GPT-label with one wedge on this raiddevice.
 >  If the filesystem on the wedge is mounted when a reboot occures I get 
 >  the following panic (and trace) in DDB:
 >  
 >  syncing disks... done
 >  unmounting file systems...Mutex error: mutex_vector_enter: locking 
 >  against myself
 >  
 >  lock address : 0xffff80002f9ffb70
 >  current cpu  :                 10
 >  current lwp  : 0xffff800087b7b000
 >  owner field  : 0xffff800087b7b000 wait/spin:                0/0
 >  
 >  panic: lock error
 >  fatal breakpoint trap in supervisor mode
 >  trap type 1 code 0 rip ffffffff804b1045 cs 8 rflags 246 cr2 
 >  7f7ffd620030 cpl 0 rsp ffff800088062530
 >  Stopped in pid 594.1 (reboot) at        netbsd:breakpoint+0x5:  leave
 >  db{10}> trace
 >  breakpoint() at netbsd:breakpoint+0x5
 >  panic() at netbsd:panic+0x24d
 >  lockdebug_abort() at netbsd:lockdebug_abort+0x42
 >  mutex_vector_enter() at netbsd:mutex_vector_enter+0x208
 >  dkwedge_del() at netbsd:dkwedge_del+0x181
 >  dkwedge_delall() at netbsd:dkwedge_delall+0x65
 >  raidclose() at netbsd:raidclose+0x133
 >  bdev_close() at netbsd:bdev_close+0x89
 >  spec_close() at netbsd:spec_close+0x231
 >  VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
 >  vn_close() at netbsd:vn_close+0x51
 >  dkclose() at netbsd:dkclose+0xcb
 >  bdev_close() at netbsd:bdev_close+0x89
 >  spec_close() at netbsd:spec_close+0x231
 >  VOP_CLOSE() at netbsd:VOP_CLOSE+0x62
 >  ffs_unmount() at netbsd:ffs_unmount+0x11e
 >  VFS_UNMOUNT() at netbsd:VFS_UNMOUNT+0x2e
 >  dounmount() at netbsd:dounmount+0xd5
 >  vfs_unmountall() at netbsd:vfs_unmountall+0x55
 >  cpu_reboot() at netbsd:cpu_reboot+0x100
 >  sys_reboot() at netbsd:sys_reboot+0x5f
 >  syscall() at netbsd:syscall+0xa0
 >  db{10}>
 >  
 >  Yes - my analyses above seems to be correct.
 >  I would tend to increase the level from serious to critical, because a 
 >  wedge on a raidframe-device is not usable!
 >  
 >  I'm not sure if it would be the correct sollution to release the 
 >  dk_openlock mutex prior calling vn_close() in both functions mentioned 
 >  above. I'm not shure if it would be possible to set dk_rawvp to NULL 
 >  prior calling vpn_close() on it - may be stored in a temp-variable.
 >  
 >  If the answer to thees questions is yes, than that would be the sollution.
 >  Can someone with more knowledge on the mutex order and the semantic of 
 >  dk_rawvp have a look on this topic.
 >  Thanks in advance
 >  
 >  gnats-admin%NetBSD.org@localhost wrote:
 >  
 >  > Thank you very much for your problem report.
 >  > It has the internal identification `kern/46879'.
 >  > The individual assigned to look at your
 >  > report is: kern-bug-people. 
 >  > 
 >  > 
 >  >>Category:       kern
 >  >>Responsible:    kern-bug-people
 >  >>Synopsis:       panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
 >  >>Arrival-Date:   Thu Aug 30 16:10:00 +0000 2012
 >  >>
 >  > 
 >  > 
 >  
 >  
 >  
 > 
 > 
 
 
 -- 
 
 
 Dr. Nagler & Company GmbH
 Hauptstraße 9
 92253 Schnaittenbach
 
 Tel. +49 9622/71 97-42
 Fax +49 9622/71 97-50
 
 Wolfgang.Stukenbrock%nagler-company.com@localhost
 http://www.nagler-company.com
 
 
 Hauptsitz: Schnaittenbach
 Handelregister: Amberg HRB
 Gerichtsstand: Amberg
 Steuernummer: 201/118/51825
 USt.-ID-Nummer: DE 273143997
 Geschäftsführer: Dr. Martin Nagler, Dr. Dr. Karl-Kuno Kunze
 
 
 --------------060607080304090807080703
 Content-Type: text/plain;
  name="dk.c-patch"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline;
  filename="dk.c-patch"
 
 --- dk.c       2012/08/31 10:57:31     1.1
 +++ dk.c       2012/08/31 11:02:46
 @@ -432,6 +432,7 @@
  dkwedge_del(struct dkwedge_info *dkw)
  {
        struct dkwedge_softc *sc = NULL;
 +      struct vnode *tmp_vp = NULL;
        u_int unit;
        int bmaj, cmaj, s;
  
 @@ -480,15 +481,15 @@
                mutex_enter(&sc->sc_parent->dk_rawlock);
                if (sc->sc_parent->dk_rawopens-- == 1) {
                        KASSERT(sc->sc_parent->dk_rawvp != NULL);
 -                      mutex_exit(&sc->sc_parent->dk_rawlock);
 -                      (void) vn_close(sc->sc_parent->dk_rawvp, FREAD | FWRITE,
 -                          NOCRED);
 +                      tmp_vp = sc->sc_parent->dk_rawvp;
                        sc->sc_parent->dk_rawvp = NULL;
 -              } else
 -                      mutex_exit(&sc->sc_parent->dk_rawlock);
 +              }
 +              mutex_exit(&sc->sc_parent->dk_rawlock);
                sc->sc_dk.dk_openmask = 0;
        }
        mutex_exit(&sc->sc_dk.dk_openlock);
 +      if (tmp_vp != NULL)
 +              (void) vn_close(tmp_vp, FREAD | FWRITE, NOCRED);
  
        /* Announce our departure. */
        aprint_normal("%s at %s (%s) deleted\n", device_xname(sc->sc_dev),
 @@ -964,7 +965,7 @@
  dkclose(dev_t dev, int flags, int fmt, struct lwp *l)
  {
        struct dkwedge_softc *sc = dkwedge_lookup(dev);
 -      int error = 0;
 +      struct vnode *tmp_vp = NULL;
  
        KASSERT(sc->sc_dk.dk_openmask != 0);
  
 @@ -981,17 +982,17 @@
                mutex_enter(&sc->sc_parent->dk_rawlock);
                if (sc->sc_parent->dk_rawopens-- == 1) {
                        KASSERT(sc->sc_parent->dk_rawvp != NULL);
 -                      mutex_exit(&sc->sc_parent->dk_rawlock);
 -                      error = vn_close(sc->sc_parent->dk_rawvp,
 -                          FREAD | FWRITE, NOCRED);
 +                      tmp_vp = sc->sc_parent->dk_rawvp = NULL;
                        sc->sc_parent->dk_rawvp = NULL;
 -              } else
 -                      mutex_exit(&sc->sc_parent->dk_rawlock);
 +              }
 +              mutex_exit(&sc->sc_parent->dk_rawlock);
        }
  
        mutex_exit(&sc->sc_dk.dk_openlock);
  
 -      return (error);
 +      if (tmp_vp != NULL)
 +              return vn_close(tmp_vp, FREAD | FWRITE, NOCRED);
 +      return 0;
  }
  
  /*
 
 --------------060607080304090807080703--

Follow-Ups:
- Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
  - From: Wolfgang Stukenbrock

Prev by Date: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
Next by Date: NetBSD Nightly Trouble Ticket Report
Previous by Thread: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
Next by Thread: Re: kern/46879: panic in reboot 5_1_STABLE in dkwedge - dead-lock detected
Indexes:

Home | Main Index | Thread Index | Old Index