NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/53441: nouveau panic in 8.0_RC2 amd64



The following reply was made to PR kern/53441; it has been noted by GNATS.

From: Greg Oster <oster%netbsd.org@localhost>
To: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
 netbsd-bugs%netbsd.org@localhost, oster%netbsd.org@localhost
Cc: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/53441: nouveau panic in 8.0_RC2 amd64
Date: Fri, 3 Aug 2018 20:53:16 -0600

 On Fri,  3 Aug 2018 23:45:01 +0000 (UTC)
 Greg Oster <oster%netbsd.org@localhost> wrote:
 
 > The following reply was made to PR kern/53441; it has been noted by
 > GNATS.
 > 
 > From: Greg Oster <oster%netbsd.org@localhost>
 > To: gnats-bugs%NetBSD.org@localhost
 > Cc: 
 > Subject: Re: kern/53441: nouveau panic in 8.0_RC2 amd64
 > Date: Fri, 3 Aug 2018 17:40:41 -0600
 > 
 >  On Tue, 10 Jul 2018 16:15:00 +0000 (UTC)
 >  oster%netbsd.org@localhost wrote:
 >  
 >  > >Number:         53441
 >  > >Category:       kern
 >  > >Synopsis:       nouveau panic in 8.0_RC2 amd64
 >  > >Confidential:   no
 >  > >Severity:       critical
 >  > >Priority:       high
 >  > >Responsible:    kern-bug-people
 >  > >State:          open
 >  > >Class:          sw-bug
 >  > >Submitter-Id:   net
 >  > >Arrival-Date:   Tue Jul 10 16:15:00 +0000 2018
 >  > >Originator:     Greg Oster
 >  > >Release:        NetBSD 8.0_RC2
 >  > >Organization:
 >  > >Environment:    
 >  > System: NetBSD thog 8.0_RC2 NetBSD 8.0_RC2 (THOG.gdb) #0: Fri Jun
 >  > 29 15:10:23 CST 2018
 >  > oster@thog:/u1/builds/build183/src/obj/amd64/u1/builds/build183/src/sys/arch/amd64/compile/THOG.gdb
 >  > amd64 Architecture: x86_64 Machine: amd64  
 >  > >Description:    
 >  > 
 >  > The nouveau driver occasionally panics for no good reason.  It can
 >  > panic when X11 is being used, and it can panic when no-one is on
 >  > the console.
 >  > 
 >  > Panic looks like:
 >  > 
 >  > uvm_fault(0xffffffff819b7d80, 0x0, 1) -> e
 >  > fatal page fault in supervisor mode
 >  > trap type 6 code 0 rip 0xffffffff8114d302 cs 0x8 rflags 0x10282 cr2
 >  > 0x70 ilevel 0x8 rsp 0xffff80013ce5bdd0 curlwp 0xfffffe843b5a0080
 >  > pid 0.16 lowest kstack 0xffff80013ce592c0 panic: trap
 >  > cpu2: Begin traceback...
 >  > vpanic() at netbsd:vpanic+0x219
 >  > vpanic() at netbsd:vpanic
 >  > trap() at netbsd:trap+0x2b9
 >  > --- trap (number 6) ---
 >  > nouveau_fence_update() at netbsd:nouveau_fence_update+0x10
 >  > nouveau_fence_done() at netbsd:nouveau_fence_done+0x29
 >  > nouveau_bo_fence_signalled() at
 >  > netbsd:nouveau_bo_fence_signalled+0x18 ttm_bo_wait() at
 >  > netbsd:ttm_bo_wait+0x90 ttm_bo_cleanup_refs_and_unlock() at
 >  > netbsd:ttm_bo_cleanup_refs_and_unlock+0x66 ttm_bo_delayed_delete()
 >  > at netbsd:ttm_bo_delayed_delete+0x175 ttm_bo_delayed_workqueue() at
 >  > netbsd:ttm_bo_delayed_workqueue+0x2b linux_worker() at
 >  > netbsd:linux_worker+0xf9 workqueue_runlist() at
 >  > netbsd:workqueue_runlist+0x59 workqueue_worker() at
 >  > netbsd:workqueue_worker+0xb1 cpu2: End traceback...
 >  > uvm_fault(0xfffffe842f5fd5c0, 0x0, 2) -> e
 >  > 
 >  > fatal page fault in supervisor mode
 >  > dumping to dev 0,1 (offset=8425399, size=4189705):
 >  > trap type 6 code 0x2 rip 0xffffffff80cb5d7b cs 0x8 rflags 0x10296
 >  > cr2 0x84 ilevel 0x8 rsp 0xffff800d1u4m2p4 b2b90 curlwp
 >  > 0xfffffe8403f36120 pid 885.2 lowest kstack 0xffff8001424b02c0
 >  > coretemp0: workqueue busy: updates stopped coretemp1: workqueue
 >  > busy: updates stopped coretemp2: workqueue busy: updates stopped
 >  > coretemp3: workqueue busy: updates stopped
 >  > 
 >  > 
 >  >   
 >  > >How-To-Repeat:    
 >  > 
 >  > Run the nouveau driver on NetBSD-8.0_RC2/amd64 using a NVIDIA
 >  > GeForce GT 420: ...
 >  > pci1 at ppb0 bus 1
 >  > pci1: i/o space, memory space enabled, rd/line, wr/inv ok
 >  > nouveau0 at pci1 dev 0 function 0: vendor 10de product 0de2 (rev.
 >  > 0xa1) drm kern info: nouveau  [  DEVICE][nouveau0] BOOT0  :
 >  > 0x0c1100a1 drm kern info: nouveau  [  DEVICE][nouveau0] Chipset:
 >  > GF108 (NVC1) drm kern info: nouveau  [  DEVICE][nouveau0] Family :
 >  > NVC0 drm kern info: nouveau  [   VBIOS][nouveau0] checking PRAMIN
 >  > for image... drm kern info: nouveau  [   VBIOS][nouveau0] ...
 >  > appears to be valid drm kern info: nouveau  [   VBIOS][nouveau0]
 >  > using image from PRAMIN drm kern info: nouveau
 >  > [   VBIOS][nouveau0] BIT signature found drm kern info: nouveau
 >  > [   VBIOS][nouveau0] version 70.08.1f.00.0c nouveau0: interrupting
 >  > at ioapic0 pin 16 (nouveau) drm kern warning: nouveau
 >  > W[     PFB][nouveau0][0x00000000][0xfffffe811d51b808] reclocking of
 >  > this ram type unsupported drm kern info: nouveau
 >  > [     PFB][nouveau0] RAM type: DDR3 drm kern info: nouveau
 >  > [     PFB][nouveau0] RAM size: 512 MiB drm kern info: nouveau
 >  > [     PFB][nouveau0]    ZCOMP: 0 tags drm kern info: nouveau
 >  > [    VOLT][nouveau0] GPU voltage: 900000uv drm kern info: nouveau
 >  > [  PTHERM][nouveau0] FAN control: PWM drm kern info: nouveau
 >  > [  PTHERM][nouveau0] fan management: automatic drm kern info:
 >  > nouveau  [  PTHERM][nouveau0] internal sensor: yes drm kern info:
 >  > nouveau  [     CLK][nouveau0] 03: core 50 MHz memory 135 MHz drm
 >  > kern info: nouveau  [     CLK][nouveau0] 07: core 405 MHz memory
 >  > 324 MHz drm kern info: nouveau  [     CLK][nouveau0] 0f: core 700
 >  > MHz memory 800 MHz drm kern info: nouveau  [     CLK][nouveau0]
 >  > --: core 405 MHz memory 324 MHz Zone  kernel: Available graphics
 >  > memory: 5504634 kiB Zone   dma32: Available graphics memory:
 >  > 2097152 kiB drm kern info: nouveau  [     DRM] VRAM: 512 MiB drm
 >  > kern info: nouveau  [     DRM] GART: 1048576 MiB drm kern info:
 >  > nouveau  [     DRM] TMDS table version 2.0 drm kern info: nouveau
 >  > [     DRM] DCB version 4.0 drm kern info: nouveau  [     DRM] DCB
 >  > outp 00: 01800302 00020030 drm kern info: nouveau  [     DRM] DCB
 >  > outp 01: 02000300 00000000 drm kern info: nouveau  [     DRM] DCB
 >  > outp 02: 08811392 00020020 drm kern info: nouveau  [     DRM] DCB
 >  > outp 03: 04822310 00000000 drm kern info: nouveau  [     DRM] DCB
 >  > conn 00: 00001030 drm kern info: nouveau  [     DRM] DCB conn 01:
 >  > 00002161 drm kern info: nouveau [     DRM] DCB conn 02: 00000200
 >  > drm: Supports vblank timestamp caching Rev 2 (21.10.2013). drm:
 >  > Driver supports precise vblank timestamp query. drm kern info:
 >  > nouveau  [     DRM] MM: using COPY0 for buffer copies nouveaufb0
 >  > at nouveau0 nouveau0: info: registered panic notifier
 >  > nouveaufb0: framebuffer at 0xffff8001400b4000, size 1920x1200,
 >  > depth 32, stride 7680 ...
 >  > 
 >  > 
 >  > and then wait for the boom.  The panic may happen in hours or days.
 >  > 
 >  >   
 >  > >Fix:    
 >  >   Please.  I have a kernel with full debug symbols and a couple of
 >  > crash dumps related to this if someone wants additional information
 >  > from them.  
 >  
 >  Traceback from gdb kernel:
 >  
 >  (gdb) bt
 >  #0  cpu_reboot (howto=260, bootstr=0x0)
 >      at /u1/builds/build185/src/sys/arch/amd64/amd64/machdep.c:710
 >  #1  0xffffffff80ceece2 in vpanic (fmt=0xffffffff81207070 "trap", 
 >      ap=0xffff80013ce5bbb8)
 >  at /u1/builds/build185/src/sys/kern/subr_prf.c:342 #2
 >  0xffffffff80ceeaba in panic (fmt=0xffffffff81207070 "trap")
 >  at /u1/builds/build185/src/sys/kern/subr_prf.c:258 #3
 >  0xffffffff80228bfd in trap (frame=0xffff80013ce5bce0)
 >  at /u1/builds/build185/src/sys/arch/amd64/amd64/trap.c:336 #4
 >  0xffffffff8021f61f in alltraps () #5  0xffffffff8114d577 in
 >  nouveau_fence_update (chan=0x0)
 >  at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:132
 >  #6  0xffffffff8114d72d in nouveau_fence_done
 > (fence=0xfffffe834add5c48)
 > at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:171
 > #7  0xffffffff811419f5 in nouveau_bo_fence_signalled
 > ( sync_obj=0xfffffe834add5c48)
 > at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_bo.c:1566
 > #8  0xffffffff8119841a in ttm_bo_wait (bo=0xfffffe82f9fc0408,
 > lazy=false, interruptible=false, no_wait=true)
 > at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:1671
 > #9  0xffffffff81195d15 in ttm_bo_cleanup_refs_and_unlock
 > ( bo=0xfffffe82f9fc0408, interruptible=false, no_wait_gpu=true)
 > at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:516
 > #10 0xffffffff81196108 in ttm_bo_delayed_delete
 > (bdev=0xfffffe811d500160, remove_all=false)
 > at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:621
 > #11 0xffffffff811961da in ttm_bo_delayed_workqueue
 > (work=0xfffffe811d500520)
 > at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:650
 > #12 0xffffffff80abf6a9 in linux_worker (wk=0xfffffe811d500520,
 > arg=0xfffffe843e620f80)
 > at /u1/builds/build185/src/sys/external/bsd/common/linux/linux_work.c:505
 > #13 0xffffffff80cf85ef in workqueue_runlist (wq=0xfffffe843b5b7d00,
 > list=0xfffffe843b5b7d70)
 > at /u1/builds/build185/src/sys/kern/subr_workqueue.c:106 #14
 > 0xffffffff80cf86b2 in workqueue_worker (cookie=0xfffffe843b5b7d00)
 > at /u1/builds/build185/src/sys/kern/subr_workqueue.c:133 #15
 > 0xffffffff80208747 in lwp_trampoline () #16 0x0000000000000000 in ??
 > () (gdb) ...
 >  (gdb) list
 >  166
 >  167     bool
 >  168     nouveau_fence_done(struct nouveau_fence *fence)
 >  169     {
 >  170             if (fence->channel)
 >  171                     nouveau_fence_update(fence->channel);
 >  172             return !fence->channel;
 >  173     }
 >  174
 >  175     static int
 >  (gdb) down
 >  #5  0xffffffff8114d577 in nouveau_fence_update (chan=0x0)
 >      at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:132
 >  132             struct nouveau_fence_chan *fctx = chan->fence;
 >  (gdb) list
 >  127     }
 >  128
 >  129     static void
 >  130     nouveau_fence_update(struct nouveau_channel *chan)
 >  131     {
 >  132             struct nouveau_fence_chan *fctx = chan->fence;
 >  133             struct nouveau_fence *fence, *fnext;
 >  134
 >  135             spin_lock(&fctx->lock);
 >  136             list_for_each_entry_safe(fence, fnext,
 > &fctx->pending, head) { 
 >  (gdb) print chan
 >  $11 = (struct nouveau_channel *) 0x0
 >  (gdb) 
 >  
 >  "huh?"
 >  
 >  We just checked fence->channel for non-zero before the call to
 >  nouveau_fence_update(), and now it's suddenly zero?  Methinks there 
 >  are some locking issues happening here if the rug is getting pulled
 >  out that fast!  Also: are there other uses of fence->channel where it
 >  could suddenly change from something to 0 and cause issues?
 >  
 >  (the machine worked fine for 8 days before this panic...)
 >  
 >  Later...
 >  
 >  Greg Oster
 >  
 
 Just fell over again.. so twice now today.  Seems there are (at least)
 two different failure modes - one where I can get a kernel trace, and
 one where it's a fast trip to reboot.... 
 
 uvm_fault(0xffffffff819b7d80, 0x0, 1) -> e
 fatal page fault in supervisor mode
 trap type 6 code 0 rip 0xffffffff8114d577 cs 0x8 rflags 0x10282 cr2
 0x70 ilevel 0x8 rsp 0xffff80013ce5bdd0 curlwp 0xfffffe843b5a0080 pid
 0.16 lowest kstack 0xffff80013ce592c0 panic: trap
 cpu1: Begin traceback...
 vpanic() at netbsd:vpanic+0x219
 vpanic() at netbsd:vpanic
 trap() at netbsd:trap+0x2b9
 --- trap (number 6) ---
 nouveau_fence_update() at netbsd:nouveau_fence_update+0x10
 nouveau_fence_done() at netbsd:nouveau_fence_done+0x29
 nouveau_bo_fence_signalled() at netbsd:nouveau_bo_fence_signalled+0x18
 ttm_bo_wait() at netbsd:ttm_bo_wait+0x90
 ttm_bo_cleanup_refs_and_unlock() at
 netbsd:ttm_bo_cleanup_refs_and_unlock+0x66 ttm_bo_delayed_delete() at
 netbsd:ttm_bo_delayed_delete+0x175 ttm_bo_delayed_workqueue() at
 netbsd:ttm_bo_delayed_workqueue+0x2b linux_worker() at
 netbsd:linux_worker+0xf9 workqueue_runlist() at
 netbsd:workqueue_runlist+0x59 workqueue_worker() at
 netbsd:workqueue_worker+0xb1 cpu1: End traceback...
 
 
 Later...
 
 Greg Oster
 
 -- 
 
 Later...
 
 Greg Oster
 


Home | Main Index | Thread Index | Old Index