Subject: Re: 1.6.2 panic
To: Jukka Marin <jmarin@embedtronics.fi>
From: Matthew Orgass <darkstar@city-net.com>
List: port-i386
Date: 03/12/2004 02:53:41
On 2004-03-11 jmarin@embedtronics.fi wrote:
> While typing in jed, my system paniced:
>
> uvm_fault(0xc04505e0, 0xfffff000, 0, 2) -> e
> fatal page fault in supervisor mode
> trap type 6 code 2 eip c01fa6a9 cs 8 eflags 10286 cr2 ffffffca cpl e000ffe7
> panic: trap
> syncing disks... 1 done
>
> dumping to dev 0,1 offset 525687
>
>
> % nm /netbsd | sort | more
> ...
> c01fa584 T callout_setsize
> c01fa5bc T callout_startup
> c01fa600 T callout_init
> c01fa614 T callout_reset <---
> c01fa6f0 t callout_stop_locked
> c01fa768 T callout_stop
> ...
>
> gdb says:
>
> (gdb) target kcore netbsd.7.core
> panic: trap
> #0 0x1 in ?? ()
> (gdb) bt
> #0 0x1 in ?? ()
> #1 0xc02d0a5b in cpu_reboot ()
> #2 0xc020ee8e in panic ()
> #3 0xc02da7c2 in trap ()
> #4 0xc0100bf7 in calltrap ()
> #5 0xc021213c in sys_select ()
> #6 0xc0337a8d in linux_select1 ()
> #7 0xc033795d in linux_sys_select ()
> #8 0xc033f6cb in linux_syscall_plain ()
> #9 0xc0100c78 in syscall1 ()
> can not access 0xbfbfcd00, invalid translation (invalid PDE)
> can not access 0xbfbfcd00, invalid translation (invalid PDE)
> Cannot access memory at address 0xbfbfcd00
>
> According to ps, the running process was acroread:
>
> USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
> jmarin 29191 0.0 0.0 5476 0 pk R Tue02PM 0:01.00 (acroread)
>
> I'm running a custom kernel, built from 1.6.2 sources. I have never had
> a crash before installing this kernel (I was running 1.6.1 and 1.6.2_ALPHA
> before)..
>
> Any ideas? Other than going back to the previous kernel version..
I guess this is in tsleep/callout_reset and the stack got squished a bit
somehow by the trap?
It looks like this is trying to dereference a bad tqh_last in
TAILQ_INSERT_TAIL in callout_reset. I don't see how this might happen
other than memory corruption.
While looking into this I noticed a softclock bug. If the next item on
the queue in softclock is removed and overwritten while a callout is being
run or in interrupts while the callout is being run or when interrupts are
enabled at MAX_SOFTCLOCK_STEPS, then softclock could fault or skip items
on the queue permanantly. At least some scsi drivers have ISRs that can
callout_stop and pool_put, which could cause the fault with DEBUG. I
don't know if it could be triggered without DEBUG. The only real fix I
can see would be to backport the new callout code, although changing the
!= softclock_ticks to a > softclock ticks would get any missed calls after
the next round of the hash if the bucket hint has been lowered.
Matthew Orgass
darkstar@city-net.com