Subject: port-mips/29395: r5k (specific?) cache problem / data corruption
To: None <firstname.lastname@example.org, email@example.com,>
From: Markus W Kilbinger <firstname.lastname@example.org>
Date: 02/15/2005 22:18:00
>Synopsis: r5k (specific?) cache problem / data corruption
>Arrival-Date: Tue Feb 15 22:18:00 +0000 2005
>Release: NetBSD 2.99.15, netbsd-2... branch also
System: NetBSD qube 2.99.15 NetBSD 2.99.15 (QUBE) #3: Tue Feb 15 17:04:08 MET 2005 kilbi@qie:/usr/src/sys/arch/cobalt/compile/QUBE cobalt
For several months now (as long as I own a qube2) I observe
some kind of data corruption on my qube2 in handling 'larger'
data amounts/disk access. I've recognized this obviously first
while installing new userland *.tgz sets ending up with a non
working system (libc.so... was corrupted).
In the following more systematic approach disk access (viaide,
wd0) was always involved if a data corruption occured. I did
not notice these data corruptions in pure network traffic (I
use the qube2 as a router) or ram access
(pkgsrc/sysutils/memtester showed no errors).
The corruptions occur always at 32 bytes boundaries and in 32
bytes size (see below) which matches very well my qube2's
cpu0 at mainbus0: QED RM5200 CPU (0x28a0) Rev. 10.0 with built-in FPU Rev. 10.0
cpu0: 32KB/32B 2-way set-associative L1 Instruction cache, 48 TLB entries
cpu0: 32KB/32B 2-way set-associative write-back L1 Data cache
The data corruption seems to appear while reading _and_
writing from/to disk (see below).
As Izumi Tsutsui <email@example.com> noted this problem
seems to involve other platforms with same cpu type too (a
R5000 O2 sgimips in his case):
The problem can be diminished (not avoided!) with:
- Putting some additional cpu load onto my qube2: E. g. for
installing new *.tgz sets I run 'nice pax -zvrpe ...' over a
ssh connection, so that pax's '-v' vorbose output produces
some additional load which prevents most file corruptions.
- Compiling kernel with higher optimization (-O3 -mtune r5000
First I've copied a 100 mb file multiple times onto the
machines harddisk and compared it ('cmp') with the original
file. This revealed the above mentioned quite randomly spread
32 bytes boundary and size mismatches.
Repeating just the 'cmp's (no new file copying) revealed
different differences between the files from time to time.
On advice of Chuck Silvers <firstname.lastname@example.org> I wrote a small
pattern generator (C program), which generates/writes large
files containing consecutive int (4 bytes) numbers to
differentiate better if corruption is write and/or read
Within this special scenario all data corruptions occured
during writing the data onto disk. Reading a single file for
comparison with consecutive numbering showed no data
corruption so far, in opposite to 'cmp' (two open files
simultaneously) which showed data corruptions.
Following is an example of all data corruptions occured after
writing a 100 mb file with my pattern generator. First column
shows the expected/generated consecutive number (4 byte int,
starting with 00000000), second the read/corrupted value from
the test file. Each line consequently represents 4 bytes of
If I understood Chuck correctly he supposes some kind of
interaction problem between bus_dma and r5k cache handling
(missing cache (line) invalidation?).