NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/53441: nouveau panic in 8.0_RC2 amd64



On Fri,  3 Aug 2018 23:45:01 +0000 (UTC)
Greg Oster <oster%netbsd.org@localhost> wrote:

> The following reply was made to PR kern/53441; it has been noted by
> GNATS.
> 
> From: Greg Oster <oster%netbsd.org@localhost>
> To: gnats-bugs%NetBSD.org@localhost
> Cc: 
> Subject: Re: kern/53441: nouveau panic in 8.0_RC2 amd64
> Date: Fri, 3 Aug 2018 17:40:41 -0600
> 
>  On Tue, 10 Jul 2018 16:15:00 +0000 (UTC)
>  oster%netbsd.org@localhost wrote:
>  
>  > >Number:         53441
>  > >Category:       kern
>  > >Synopsis:       nouveau panic in 8.0_RC2 amd64
>  > >Confidential:   no
>  > >Severity:       critical
>  > >Priority:       high
>  > >Responsible:    kern-bug-people
>  > >State:          open
>  > >Class:          sw-bug
>  > >Submitter-Id:   net
>  > >Arrival-Date:   Tue Jul 10 16:15:00 +0000 2018
>  > >Originator:     Greg Oster
>  > >Release:        NetBSD 8.0_RC2
>  > >Organization:
>  > >Environment:    
>  > System: NetBSD thog 8.0_RC2 NetBSD 8.0_RC2 (THOG.gdb) #0: Fri Jun
>  > 29 15:10:23 CST 2018
>  > oster@thog:/u1/builds/build183/src/obj/amd64/u1/builds/build183/src/sys/arch/amd64/compile/THOG.gdb
>  > amd64 Architecture: x86_64 Machine: amd64  
>  > >Description:    
>  > 
>  > The nouveau driver occasionally panics for no good reason.  It can
>  > panic when X11 is being used, and it can panic when no-one is on
>  > the console.
>  > 
>  > Panic looks like:
>  > 
>  > uvm_fault(0xffffffff819b7d80, 0x0, 1) -> e
>  > fatal page fault in supervisor mode
>  > trap type 6 code 0 rip 0xffffffff8114d302 cs 0x8 rflags 0x10282 cr2
>  > 0x70 ilevel 0x8 rsp 0xffff80013ce5bdd0 curlwp 0xfffffe843b5a0080
>  > pid 0.16 lowest kstack 0xffff80013ce592c0 panic: trap
>  > cpu2: Begin traceback...
>  > vpanic() at netbsd:vpanic+0x219
>  > vpanic() at netbsd:vpanic
>  > trap() at netbsd:trap+0x2b9
>  > --- trap (number 6) ---
>  > nouveau_fence_update() at netbsd:nouveau_fence_update+0x10
>  > nouveau_fence_done() at netbsd:nouveau_fence_done+0x29
>  > nouveau_bo_fence_signalled() at
>  > netbsd:nouveau_bo_fence_signalled+0x18 ttm_bo_wait() at
>  > netbsd:ttm_bo_wait+0x90 ttm_bo_cleanup_refs_and_unlock() at
>  > netbsd:ttm_bo_cleanup_refs_and_unlock+0x66 ttm_bo_delayed_delete()
>  > at netbsd:ttm_bo_delayed_delete+0x175 ttm_bo_delayed_workqueue() at
>  > netbsd:ttm_bo_delayed_workqueue+0x2b linux_worker() at
>  > netbsd:linux_worker+0xf9 workqueue_runlist() at
>  > netbsd:workqueue_runlist+0x59 workqueue_worker() at
>  > netbsd:workqueue_worker+0xb1 cpu2: End traceback...
>  > uvm_fault(0xfffffe842f5fd5c0, 0x0, 2) -> e
>  > 
>  > fatal page fault in supervisor mode
>  > dumping to dev 0,1 (offset=8425399, size=4189705):
>  > trap type 6 code 0x2 rip 0xffffffff80cb5d7b cs 0x8 rflags 0x10296
>  > cr2 0x84 ilevel 0x8 rsp 0xffff800d1u4m2p4 b2b90 curlwp
>  > 0xfffffe8403f36120 pid 885.2 lowest kstack 0xffff8001424b02c0
>  > coretemp0: workqueue busy: updates stopped coretemp1: workqueue
>  > busy: updates stopped coretemp2: workqueue busy: updates stopped
>  > coretemp3: workqueue busy: updates stopped
>  > 
>  > 
>  >   
>  > >How-To-Repeat:    
>  > 
>  > Run the nouveau driver on NetBSD-8.0_RC2/amd64 using a NVIDIA
>  > GeForce GT 420: ...
>  > pci1 at ppb0 bus 1
>  > pci1: i/o space, memory space enabled, rd/line, wr/inv ok
>  > nouveau0 at pci1 dev 0 function 0: vendor 10de product 0de2 (rev.
>  > 0xa1) drm kern info: nouveau  [  DEVICE][nouveau0] BOOT0  :
>  > 0x0c1100a1 drm kern info: nouveau  [  DEVICE][nouveau0] Chipset:
>  > GF108 (NVC1) drm kern info: nouveau  [  DEVICE][nouveau0] Family :
>  > NVC0 drm kern info: nouveau  [   VBIOS][nouveau0] checking PRAMIN
>  > for image... drm kern info: nouveau  [   VBIOS][nouveau0] ...
>  > appears to be valid drm kern info: nouveau  [   VBIOS][nouveau0]
>  > using image from PRAMIN drm kern info: nouveau
>  > [   VBIOS][nouveau0] BIT signature found drm kern info: nouveau
>  > [   VBIOS][nouveau0] version 70.08.1f.00.0c nouveau0: interrupting
>  > at ioapic0 pin 16 (nouveau) drm kern warning: nouveau
>  > W[     PFB][nouveau0][0x00000000][0xfffffe811d51b808] reclocking of
>  > this ram type unsupported drm kern info: nouveau
>  > [     PFB][nouveau0] RAM type: DDR3 drm kern info: nouveau
>  > [     PFB][nouveau0] RAM size: 512 MiB drm kern info: nouveau
>  > [     PFB][nouveau0]    ZCOMP: 0 tags drm kern info: nouveau
>  > [    VOLT][nouveau0] GPU voltage: 900000uv drm kern info: nouveau
>  > [  PTHERM][nouveau0] FAN control: PWM drm kern info: nouveau
>  > [  PTHERM][nouveau0] fan management: automatic drm kern info:
>  > nouveau  [  PTHERM][nouveau0] internal sensor: yes drm kern info:
>  > nouveau  [     CLK][nouveau0] 03: core 50 MHz memory 135 MHz drm
>  > kern info: nouveau  [     CLK][nouveau0] 07: core 405 MHz memory
>  > 324 MHz drm kern info: nouveau  [     CLK][nouveau0] 0f: core 700
>  > MHz memory 800 MHz drm kern info: nouveau  [     CLK][nouveau0]
>  > --: core 405 MHz memory 324 MHz Zone  kernel: Available graphics
>  > memory: 5504634 kiB Zone   dma32: Available graphics memory:
>  > 2097152 kiB drm kern info: nouveau  [     DRM] VRAM: 512 MiB drm
>  > kern info: nouveau  [     DRM] GART: 1048576 MiB drm kern info:
>  > nouveau  [     DRM] TMDS table version 2.0 drm kern info: nouveau
>  > [     DRM] DCB version 4.0 drm kern info: nouveau  [     DRM] DCB
>  > outp 00: 01800302 00020030 drm kern info: nouveau  [     DRM] DCB
>  > outp 01: 02000300 00000000 drm kern info: nouveau  [     DRM] DCB
>  > outp 02: 08811392 00020020 drm kern info: nouveau  [     DRM] DCB
>  > outp 03: 04822310 00000000 drm kern info: nouveau  [     DRM] DCB
>  > conn 00: 00001030 drm kern info: nouveau  [     DRM] DCB conn 01:
>  > 00002161 drm kern info: nouveau [     DRM] DCB conn 02: 00000200
>  > drm: Supports vblank timestamp caching Rev 2 (21.10.2013). drm:
>  > Driver supports precise vblank timestamp query. drm kern info:
>  > nouveau  [     DRM] MM: using COPY0 for buffer copies nouveaufb0
>  > at nouveau0 nouveau0: info: registered panic notifier
>  > nouveaufb0: framebuffer at 0xffff8001400b4000, size 1920x1200,
>  > depth 32, stride 7680 ...
>  > 
>  > 
>  > and then wait for the boom.  The panic may happen in hours or days.
>  > 
>  >   
>  > >Fix:    
>  >   Please.  I have a kernel with full debug symbols and a couple of
>  > crash dumps related to this if someone wants additional information
>  > from them.  
>  
>  Traceback from gdb kernel:
>  
>  (gdb) bt
>  #0  cpu_reboot (howto=260, bootstr=0x0)
>      at /u1/builds/build185/src/sys/arch/amd64/amd64/machdep.c:710
>  #1  0xffffffff80ceece2 in vpanic (fmt=0xffffffff81207070 "trap", 
>      ap=0xffff80013ce5bbb8)
>  at /u1/builds/build185/src/sys/kern/subr_prf.c:342 #2
>  0xffffffff80ceeaba in panic (fmt=0xffffffff81207070 "trap")
>  at /u1/builds/build185/src/sys/kern/subr_prf.c:258 #3
>  0xffffffff80228bfd in trap (frame=0xffff80013ce5bce0)
>  at /u1/builds/build185/src/sys/arch/amd64/amd64/trap.c:336 #4
>  0xffffffff8021f61f in alltraps () #5  0xffffffff8114d577 in
>  nouveau_fence_update (chan=0x0)
>  at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:132
>  #6  0xffffffff8114d72d in nouveau_fence_done
> (fence=0xfffffe834add5c48)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:171
> #7  0xffffffff811419f5 in nouveau_bo_fence_signalled
> ( sync_obj=0xfffffe834add5c48)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_bo.c:1566
> #8  0xffffffff8119841a in ttm_bo_wait (bo=0xfffffe82f9fc0408,
> lazy=false, interruptible=false, no_wait=true)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:1671
> #9  0xffffffff81195d15 in ttm_bo_cleanup_refs_and_unlock
> ( bo=0xfffffe82f9fc0408, interruptible=false, no_wait_gpu=true)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:516
> #10 0xffffffff81196108 in ttm_bo_delayed_delete
> (bdev=0xfffffe811d500160, remove_all=false)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:621
> #11 0xffffffff811961da in ttm_bo_delayed_workqueue
> (work=0xfffffe811d500520)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:650
> #12 0xffffffff80abf6a9 in linux_worker (wk=0xfffffe811d500520,
> arg=0xfffffe843e620f80)
> at /u1/builds/build185/src/sys/external/bsd/common/linux/linux_work.c:505
> #13 0xffffffff80cf85ef in workqueue_runlist (wq=0xfffffe843b5b7d00,
> list=0xfffffe843b5b7d70)
> at /u1/builds/build185/src/sys/kern/subr_workqueue.c:106 #14
> 0xffffffff80cf86b2 in workqueue_worker (cookie=0xfffffe843b5b7d00)
> at /u1/builds/build185/src/sys/kern/subr_workqueue.c:133 #15
> 0xffffffff80208747 in lwp_trampoline () #16 0x0000000000000000 in ??
> () (gdb) ...
>  (gdb) list
>  166
>  167     bool
>  168     nouveau_fence_done(struct nouveau_fence *fence)
>  169     {
>  170             if (fence->channel)
>  171                     nouveau_fence_update(fence->channel);
>  172             return !fence->channel;
>  173     }
>  174
>  175     static int
>  (gdb) down
>  #5  0xffffffff8114d577 in nouveau_fence_update (chan=0x0)
>      at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:132
>  132             struct nouveau_fence_chan *fctx = chan->fence;
>  (gdb) list
>  127     }
>  128
>  129     static void
>  130     nouveau_fence_update(struct nouveau_channel *chan)
>  131     {
>  132             struct nouveau_fence_chan *fctx = chan->fence;
>  133             struct nouveau_fence *fence, *fnext;
>  134
>  135             spin_lock(&fctx->lock);
>  136             list_for_each_entry_safe(fence, fnext,
> &fctx->pending, head) { 
>  (gdb) print chan
>  $11 = (struct nouveau_channel *) 0x0
>  (gdb) 
>  
>  "huh?"
>  
>  We just checked fence->channel for non-zero before the call to
>  nouveau_fence_update(), and now it's suddenly zero?  Methinks there 
>  are some locking issues happening here if the rug is getting pulled
>  out that fast!  Also: are there other uses of fence->channel where it
>  could suddenly change from something to 0 and cause issues?
>  
>  (the machine worked fine for 8 days before this panic...)
>  
>  Later...
>  
>  Greg Oster
>  

Just fell over again.. so twice now today.  Seems there are (at least)
two different failure modes - one where I can get a kernel trace, and
one where it's a fast trip to reboot.... 

uvm_fault(0xffffffff819b7d80, 0x0, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip 0xffffffff8114d577 cs 0x8 rflags 0x10282 cr2
0x70 ilevel 0x8 rsp 0xffff80013ce5bdd0 curlwp 0xfffffe843b5a0080 pid
0.16 lowest kstack 0xffff80013ce592c0 panic: trap
cpu1: Begin traceback...
vpanic() at netbsd:vpanic+0x219
vpanic() at netbsd:vpanic
trap() at netbsd:trap+0x2b9
--- trap (number 6) ---
nouveau_fence_update() at netbsd:nouveau_fence_update+0x10
nouveau_fence_done() at netbsd:nouveau_fence_done+0x29
nouveau_bo_fence_signalled() at netbsd:nouveau_bo_fence_signalled+0x18
ttm_bo_wait() at netbsd:ttm_bo_wait+0x90
ttm_bo_cleanup_refs_and_unlock() at
netbsd:ttm_bo_cleanup_refs_and_unlock+0x66 ttm_bo_delayed_delete() at
netbsd:ttm_bo_delayed_delete+0x175 ttm_bo_delayed_workqueue() at
netbsd:ttm_bo_delayed_workqueue+0x2b linux_worker() at
netbsd:linux_worker+0xf9 workqueue_runlist() at
netbsd:workqueue_runlist+0x59 workqueue_worker() at
netbsd:workqueue_worker+0xb1 cpu1: End traceback...


Later...

Greg Oster

-- 

Later...

Greg Oster


Home | Main Index | Thread Index | Old Index