tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: net interface queue corruption (was Re: checking m->m_pkthdr.csum_flags in ip_output())



On Mon, May 19, 2008 at 12:36:18PM -0500, David Young wrote:
> On Sat, May 17, 2008 at 01:37:58PM +0100, Patrick Welche wrote:
> > On Fri, May 16, 2008 at 05:14:10PM -0500, David Young wrote:
> > > On Fri, May 16, 2008 at 03:05:03PM +0100, Patrick Welche wrote:
> > > > Don't know whether relevant, but a 4.99.60/i386 box with bge gave:
> > > > 
> > > > uvm_fault(0xcdfae574, 0, 1) -> 0xe
> > > > kernel: supervisor trap page fault, code=0
> > > > Stopped in pid 22172.1 (dhcpd) at       0xc03a6f25:     movl    
> > > > 0x14(%eax),%eax
> > > > db{1}> bt/l
> > > > m_length(0,0,cd985abc,c0377c4f,5) at 0xc03a6f25
> > > > bpf_mtap(c2d822c0,0,cd985aec,c03a8f5d,cd985a05) at netbsd:bpf_mtap+0x17
> > > > bge_start(c2da7004,178,9000003,3,0) at netbsd:bge_start+0x10c
> > > > ifq_enqueue(c2da7004,c3111300,c2da7004,2,cdfae574) at 
> > > > netbsd:ifq_enqueue+0x13f
> > > > ether_output(c2da7004,c3111300,c06077a0,0,c06077a0) at 
> > > > netbsd:ether_output+0x71e
> > > > bpf_write(cdc82300,cdc82300,cd985c60,d5bf99c0,1) at 
> > > > netbsd:bpf_write+0x126
> > > > do_filewritev(7,bfbfc668,3,cdc82300,1) at netbsd:do_filewritev+0x270
> > > > sys_writev(cdfac900,cd985d04,cd985cfc,cd985d10,c03d0d79) at 
> > > > netbsd:sys_writev+0x3f
> > > > syscall(cd985d48,b3,ab,bfbf001f,bfbf001f) at netbsd:syscall+0x141
> > > > 
> > > > yesterday...
> > > 
> > > It looks like IFQ_POLL()/IFQ_DEQUEUE() did not honor their contract.  In
> > > order to reach the bpf_mtap() statement, IFQ_POLL() had to return m_head
> > > != NULL.  According to altq(9), "It is guaranteed that IFQ_DEQUEUE()
> > > immediately after IFQ_POLL() returns the same packet."
> > > 
> > > Are you using ALTQ?
> > 
> > No altq - also this was a kernel from 21st April - I hope I didn't
> > hijack Takahiro's thread - just noticed that they were both with
> > bge.
> 
> Is this an SMP box?  I don't know how this could happen unless a second
> thread or an interrupt handler ran bge_start() simultaneously with the
> thread where the fault occurred.  Looking at the bge(4) code, I don't
> see how that could happen.

Without looking at the code, it seems that the bpf fileops need to take
kernel_lock.

Andrew



Home | Main Index | Thread Index | Old Index