Subject: Re: 1.6.2 panic
To: Jukka Marin <jmarin@embedtronics.fi>
From: Matthew Orgass <darkstar@city-net.com>
List: port-i386
Date: 03/12/2004 02:53:41
On 2004-03-11 jmarin@embedtronics.fi wrote:

> While typing in jed, my system paniced:
>
> uvm_fault(0xc04505e0, 0xfffff000, 0, 2) -> e
> fatal page fault in supervisor mode
> trap type 6 code 2 eip c01fa6a9 cs 8 eflags 10286 cr2 ffffffca cpl e000ffe7
> panic: trap
> syncing disks... 1 done
>
> dumping to dev 0,1 offset 525687
>
>
> % nm /netbsd | sort | more
> ...
> c01fa584 T callout_setsize
> c01fa5bc T callout_startup
> c01fa600 T callout_init
> c01fa614 T callout_reset                <---
> c01fa6f0 t callout_stop_locked
> c01fa768 T callout_stop
> ...
>
> gdb says:
>
> (gdb) target kcore  netbsd.7.core
> panic: trap
> #0  0x1 in ?? ()
> (gdb) bt
> #0  0x1 in ?? ()
> #1  0xc02d0a5b in cpu_reboot ()
> #2  0xc020ee8e in panic ()
> #3  0xc02da7c2 in trap ()
> #4  0xc0100bf7 in calltrap ()
> #5  0xc021213c in sys_select ()
> #6  0xc0337a8d in linux_select1 ()
> #7  0xc033795d in linux_sys_select ()
> #8  0xc033f6cb in linux_syscall_plain ()
> #9  0xc0100c78 in syscall1 ()
> can not access 0xbfbfcd00, invalid translation (invalid PDE)
> can not access 0xbfbfcd00, invalid translation (invalid PDE)
> Cannot access memory at address 0xbfbfcd00
>
> According to ps, the running process was acroread:
>
> USER     PID %CPU %MEM    VSZ RSS TT  STAT STARTED    TIME COMMAND
> jmarin 29191  0.0  0.0   5476   0 pk  R    Tue02PM 0:01.00 (acroread)
>
> I'm running a custom kernel, built from 1.6.2 sources.  I have never had
> a crash before installing this kernel (I was running 1.6.1 and 1.6.2_ALPHA
> before)..
>
> Any ideas?  Other than going back to the previous kernel version..

  I guess this is in tsleep/callout_reset and the stack got squished a bit
somehow by the trap?

  It looks like this is trying to dereference a bad tqh_last in
TAILQ_INSERT_TAIL in callout_reset.  I don't see how this might happen
other than memory corruption.


  While looking into this I noticed a softclock bug.  If the next item on
the queue in softclock is removed and overwritten while a callout is being
run or in interrupts while the callout is being run or when interrupts are
enabled at MAX_SOFTCLOCK_STEPS, then softclock could fault or skip items
on the queue permanantly.  At least some scsi drivers have ISRs that can
callout_stop and pool_put, which could cause the fault with DEBUG.  I
don't know if it could be triggered without DEBUG.  The only real fix I
can see would be to backport the new callout code, although changing the
!= softclock_ticks to a > softclock ticks would get any missed calls after
the next round of the hash if the bucket hint has been lowered.


Matthew Orgass
darkstar@city-net.com