NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/53441: nouveau panic in 8.0_RC2 amd64



The following reply was made to PR kern/53441; it has been noted by GNATS.

From: Greg Oster <oster%netbsd.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: 
Subject: Re: kern/53441: nouveau panic in 8.0_RC2 amd64
Date: Fri, 3 Aug 2018 17:40:41 -0600

 On Tue, 10 Jul 2018 16:15:00 +0000 (UTC)
 oster%netbsd.org@localhost wrote:
 
 > >Number:         53441
 > >Category:       kern
 > >Synopsis:       nouveau panic in 8.0_RC2 amd64
 > >Confidential:   no
 > >Severity:       critical
 > >Priority:       high
 > >Responsible:    kern-bug-people
 > >State:          open
 > >Class:          sw-bug
 > >Submitter-Id:   net
 > >Arrival-Date:   Tue Jul 10 16:15:00 +0000 2018
 > >Originator:     Greg Oster
 > >Release:        NetBSD 8.0_RC2
 > >Organization:
 > >Environment:  
 > System: NetBSD thog 8.0_RC2 NetBSD 8.0_RC2 (THOG.gdb) #0: Fri Jun 29
 > 15:10:23 CST 2018
 > oster@thog:/u1/builds/build183/src/obj/amd64/u1/builds/build183/src/sys/arch/amd64/compile/THOG.gdb
 > amd64 Architecture: x86_64 Machine: amd64
 > >Description:  
 > 
 > The nouveau driver occasionally panics for no good reason.  It can
 > panic when X11 is being used, and it can panic when no-one is on the
 > console.
 > 
 > Panic looks like:
 > 
 > uvm_fault(0xffffffff819b7d80, 0x0, 1) -> e
 > fatal page fault in supervisor mode
 > trap type 6 code 0 rip 0xffffffff8114d302 cs 0x8 rflags 0x10282 cr2
 > 0x70 ilevel 0x8 rsp 0xffff80013ce5bdd0 curlwp 0xfffffe843b5a0080 pid
 > 0.16 lowest kstack 0xffff80013ce592c0 panic: trap
 > cpu2: Begin traceback...
 > vpanic() at netbsd:vpanic+0x219
 > vpanic() at netbsd:vpanic
 > trap() at netbsd:trap+0x2b9
 > --- trap (number 6) ---
 > nouveau_fence_update() at netbsd:nouveau_fence_update+0x10
 > nouveau_fence_done() at netbsd:nouveau_fence_done+0x29
 > nouveau_bo_fence_signalled() at netbsd:nouveau_bo_fence_signalled+0x18
 > ttm_bo_wait() at netbsd:ttm_bo_wait+0x90
 > ttm_bo_cleanup_refs_and_unlock() at
 > netbsd:ttm_bo_cleanup_refs_and_unlock+0x66 ttm_bo_delayed_delete() at
 > netbsd:ttm_bo_delayed_delete+0x175 ttm_bo_delayed_workqueue() at
 > netbsd:ttm_bo_delayed_workqueue+0x2b linux_worker() at
 > netbsd:linux_worker+0xf9 workqueue_runlist() at
 > netbsd:workqueue_runlist+0x59 workqueue_worker() at
 > netbsd:workqueue_worker+0xb1 cpu2: End traceback...
 > uvm_fault(0xfffffe842f5fd5c0, 0x0, 2) -> e
 > 
 > fatal page fault in supervisor mode
 > dumping to dev 0,1 (offset=8425399, size=4189705):
 > trap type 6 code 0x2 rip 0xffffffff80cb5d7b cs 0x8 rflags 0x10296 cr2
 > 0x84 ilevel 0x8 rsp 0xffff800d1u4m2p4 b2b90 curlwp 0xfffffe8403f36120
 > pid 885.2 lowest kstack 0xffff8001424b02c0 coretemp0: workqueue busy:
 > updates stopped coretemp1: workqueue busy: updates stopped
 > coretemp2: workqueue busy: updates stopped
 > coretemp3: workqueue busy: updates stopped
 > 
 > 
 > 
 > >How-To-Repeat:  
 > 
 > Run the nouveau driver on NetBSD-8.0_RC2/amd64 using a NVIDIA GeForce
 > GT 420: ...
 > pci1 at ppb0 bus 1
 > pci1: i/o space, memory space enabled, rd/line, wr/inv ok
 > nouveau0 at pci1 dev 0 function 0: vendor 10de product 0de2 (rev.
 > 0xa1) drm kern info: nouveau  [  DEVICE][nouveau0] BOOT0  : 0x0c1100a1
 > drm kern info: nouveau  [  DEVICE][nouveau0] Chipset: GF108 (NVC1)
 > drm kern info: nouveau  [  DEVICE][nouveau0] Family : NVC0
 > drm kern info: nouveau  [   VBIOS][nouveau0] checking PRAMIN for
 > image... drm kern info: nouveau  [   VBIOS][nouveau0] ... appears to
 > be valid drm kern info: nouveau  [   VBIOS][nouveau0] using image
 > from PRAMIN drm kern info: nouveau  [   VBIOS][nouveau0] BIT
 > signature found drm kern info: nouveau  [   VBIOS][nouveau0] version
 > 70.08.1f.00.0c nouveau0: interrupting at ioapic0 pin 16 (nouveau)
 > drm kern warning: nouveau
 > W[     PFB][nouveau0][0x00000000][0xfffffe811d51b808] reclocking of
 > this ram type unsupported drm kern info: nouveau
 > [     PFB][nouveau0] RAM type: DDR3 drm kern info: nouveau
 > [     PFB][nouveau0] RAM size: 512 MiB drm kern info: nouveau
 > [     PFB][nouveau0]    ZCOMP: 0 tags drm kern info: nouveau
 > [    VOLT][nouveau0] GPU voltage: 900000uv drm kern info: nouveau
 > [  PTHERM][nouveau0] FAN control: PWM drm kern info: nouveau
 > [  PTHERM][nouveau0] fan management: automatic drm kern info:
 > nouveau  [  PTHERM][nouveau0] internal sensor: yes drm kern info:
 > nouveau  [     CLK][nouveau0] 03: core 50 MHz memory 135 MHz drm kern
 > info: nouveau  [     CLK][nouveau0] 07: core 405 MHz memory 324 MHz
 > drm kern info: nouveau  [     CLK][nouveau0] 0f: core 700 MHz memory
 > 800 MHz drm kern info: nouveau  [     CLK][nouveau0] --: core 405 MHz
 > memory 324 MHz Zone  kernel: Available graphics memory: 5504634 kiB
 > Zone   dma32: Available graphics memory: 2097152 kiB drm kern info:
 > nouveau  [     DRM] VRAM: 512 MiB drm kern info: nouveau  [     DRM]
 > GART: 1048576 MiB drm kern info: nouveau  [     DRM] TMDS table
 > version 2.0 drm kern info: nouveau  [     DRM] DCB version 4.0 drm
 > kern info: nouveau  [     DRM] DCB outp 00: 01800302 00020030 drm
 > kern info: nouveau  [     DRM] DCB outp 01: 02000300 00000000 drm
 > kern info: nouveau  [     DRM] DCB outp 02: 08811392 00020020 drm
 > kern info: nouveau  [     DRM] DCB outp 03: 04822310 00000000 drm
 > kern info: nouveau  [     DRM] DCB conn 00: 00001030 drm kern info:
 > nouveau  [     DRM] DCB conn 01: 00002161 drm kern info: nouveau
 > [     DRM] DCB conn 02: 00000200 drm: Supports vblank timestamp
 > caching Rev 2 (21.10.2013). drm: Driver supports precise vblank
 > timestamp query. drm kern info: nouveau  [     DRM] MM: using COPY0
 > for buffer copies nouveaufb0 at nouveau0
 > nouveau0: info: registered panic notifier
 > nouveaufb0: framebuffer at 0xffff8001400b4000, size 1920x1200, depth
 > 32, stride 7680 ...
 > 
 > 
 > and then wait for the boom.  The panic may happen in hours or days.
 > 
 > 
 > >Fix:  
 >   Please.  I have a kernel with full debug symbols and a couple of
 > crash dumps related to this if someone wants additional information
 > from them.
 
 Traceback from gdb kernel:
 
 (gdb) bt
 #0  cpu_reboot (howto=260, bootstr=0x0)
     at /u1/builds/build185/src/sys/arch/amd64/amd64/machdep.c:710
 #1  0xffffffff80ceece2 in vpanic (fmt=0xffffffff81207070 "trap", 
     ap=0xffff80013ce5bbb8)
 at /u1/builds/build185/src/sys/kern/subr_prf.c:342 #2
 0xffffffff80ceeaba in panic (fmt=0xffffffff81207070 "trap")
 at /u1/builds/build185/src/sys/kern/subr_prf.c:258 #3
 0xffffffff80228bfd in trap (frame=0xffff80013ce5bce0)
 at /u1/builds/build185/src/sys/arch/amd64/amd64/trap.c:336 #4
 0xffffffff8021f61f in alltraps () #5  0xffffffff8114d577 in
 nouveau_fence_update (chan=0x0)
 at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:132
 #6  0xffffffff8114d72d in nouveau_fence_done (fence=0xfffffe834add5c48)
 at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:171
 #7  0xffffffff811419f5 in nouveau_bo_fence_signalled
 ( sync_obj=0xfffffe834add5c48)
 at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_bo.c:1566
 #8  0xffffffff8119841a in ttm_bo_wait (bo=0xfffffe82f9fc0408,
 lazy=false, interruptible=false, no_wait=true)
 at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:1671
 #9  0xffffffff81195d15 in ttm_bo_cleanup_refs_and_unlock
 ( bo=0xfffffe82f9fc0408, interruptible=false, no_wait_gpu=true)
 at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:516
 #10 0xffffffff81196108 in ttm_bo_delayed_delete
 (bdev=0xfffffe811d500160, remove_all=false)
 at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:621
 #11 0xffffffff811961da in ttm_bo_delayed_workqueue
 (work=0xfffffe811d500520)
 at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:650
 #12 0xffffffff80abf6a9 in linux_worker (wk=0xfffffe811d500520,
 arg=0xfffffe843e620f80)
 at /u1/builds/build185/src/sys/external/bsd/common/linux/linux_work.c:505
 #13 0xffffffff80cf85ef in workqueue_runlist (wq=0xfffffe843b5b7d00,
 list=0xfffffe843b5b7d70)
 at /u1/builds/build185/src/sys/kern/subr_workqueue.c:106 #14
 0xffffffff80cf86b2 in workqueue_worker (cookie=0xfffffe843b5b7d00)
 at /u1/builds/build185/src/sys/kern/subr_workqueue.c:133 #15
 0xffffffff80208747 in lwp_trampoline () #16 0x0000000000000000 in ?? ()
 (gdb)
 ...
 (gdb) list
 166
 167     bool
 168     nouveau_fence_done(struct nouveau_fence *fence)
 169     {
 170             if (fence->channel)
 171                     nouveau_fence_update(fence->channel);
 172             return !fence->channel;
 173     }
 174
 175     static int
 (gdb) down
 #5  0xffffffff8114d577 in nouveau_fence_update (chan=0x0)
     at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:132
 132             struct nouveau_fence_chan *fctx = chan->fence;
 (gdb) list
 127     }
 128
 129     static void
 130     nouveau_fence_update(struct nouveau_channel *chan)
 131     {
 132             struct nouveau_fence_chan *fctx = chan->fence;
 133             struct nouveau_fence *fence, *fnext;
 134
 135             spin_lock(&fctx->lock);
 136             list_for_each_entry_safe(fence, fnext, &fctx->pending,
 head) { 
 (gdb) print chan
 $11 = (struct nouveau_channel *) 0x0
 (gdb) 
 
 "huh?"
 
 We just checked fence->channel for non-zero before the call to
 nouveau_fence_update(), and now it's suddenly zero?  Methinks there 
 are some locking issues happening here if the rug is getting pulled
 out that fast!  Also: are there other uses of fence->channel where it
 could suddenly change from something to 0 and cause issues?
 
 (the machine worked fine for 8 days before this panic...)
 
 Later...
 
 Greg Oster
 


Home | Main Index | Thread Index | Old Index