Subject: Re: Data corruption with dump (mmap related??)
To: Wayne Knowles <w.knowles@niwa.cri.nz>
From: Chris G. Demetriou <cgd@sibyte.com>
List: port-mips
Date: 08/27/2000 01:10:01
Wayne Knowles <w.knowles@niwa.cri.nz> writes:
> What I did the other day tracking down the cause was overflush the cache.
> I added 32 to the size flushed and it didn't change tha fault.  AT one
> stage I flushed the entire cache to be sure.

(and those didn't help, i'm assuming. 8-)


> > if the only values that work are < your cache line size, this is
> > probably a flushing bug.  (if not, it's something in fault or TLB
> > handling.)
> 
> Cache line size is 4 bytes on a R3k... simple and easy - no L2 cache.  It
> is also a physically indexed cache so unless I'm mistaken it doesn't need
> any flushing to be coherent between context switches (section 6.2.1 of
> Schimmel).   
> We should have cohenerncy between kernel and userland when the TLB entries
> get updated.

yah.


> > it seems to me that if it's not a cache problem, then it's gotta be
> > something weird happening when that next page is being mapped into
> > your address space, and for that i'd look at the 'pagefault' code in
> > trap.c...
> 
> Thanks... will take a closer look in that area.

more thinking aloud:

* you might check to see if some other chunk of space is being
clobbered with the data you want (some suspects might be the area
right _before_ the buffer read address, for instance).  (if it really
were this, i wouldn't expect it to get better by retrying though.)
(a more complex check would, upon detecting the error, troll /dev/mem
to look for the desired pattern probably in the first bytes of some
page, and see how that page and your process's physical page at that
VA relate.)

* does your DMA controller for your SCSI chip have any 'weird'
alignment requirements?  in particular, looking through your asc.c
code i note you have code to prime the DMA fifo if the start address
isn't block aligned, is there anything similar necessary for the
tail-end of the buffer, and in particular does a partial fifo's worth
of data at the end of an xfer get dumped into memory?  in
asc_dma_intr, i notice resid has a max value of 15, and you seem to
throw away any data in the FIFO in the !DMA_PULLUP case... if the
residual data doesn't get dumped into memory, you'd lose it here.
might add a printf, unless you know this isn't the cause.  (again,
this is one of those where if it really were the problem, i wouldn't
think that retrying the read would help.  (I'd be surprised, but not
shocked, to know of bugs in two different drivers that happened to
behave the same way. 8-) (BTW, in the priming code, why don't you use
read/write_multi_2? 8-)



> It would also be nice to know if the problem exists on a R4000 machine...
> anyone else out there???

I can't easily help with that, sorry.



chris