Subject: Data corruption with dump (mmap related??)
To: None <port-mips@netbsd.org>
From: Wayne Knowles <w.knowles@niwa.cri.nz>
List: port-mips
Date: 08/25/2000 23:06:21
I have recently uncovered a serious problem with dump corrupting data
that appears to be a -mips related problem.

What is happening is at random 8k intervals (sometimes 2k or 4k)
4 bytes of the file are getting corrupted.  Generally the replacement
data is all 0's

This problem was first witnessed on NetBSD/mipsco 1.5E but the problem can 
also be reproduced at will under NetBSD/pmax running 1.5B

NetBSD/alpha or NetBSD/sparc 1.4.2 cannot reproduce this bug which
eliminated dump as a possible cause.

This is the script used to reproduce the problem.  You will have to set
OUTDIR to a different disk partition from root to avoid problems.

------CUT HERE-----
#! /bin/sh

OUTDIR=/home/tmp

mkdir $OUTDIR
cd $OUTDIR
dump -0f - / | restore -rf -

for F in /sbin/*
do
    echo file: $F
    cmp -l $F $OUTDIR/$F
done
------CUT HERE-----

When the script runs on a mipsco or pmax system it produces errors like
the following:

file: /sbin/atactl
  8189 214   0
  8190 306   0
file: /sbin/badsect
file: /sbin/ccdconfig
  2045 217   0
  2046 231   0
  2047 200   0
  2048 130   0
 24573  24   0
 24574 100   0
 24575 377   0
 24576 272   0
 43005 257   0
 43006 264   0
 43008 340   0
 65533 217   0
 65534 231   0
 65535 202   0
 65536 224   0
   ......


I'm pretty confident it is mmap related as the following patch to dump
which fills the mmap'ed region to 255 also changes the 0 to 255 in the
corrupted region:

Index: rcache.c
===================================================================
RCS file: /cvsroot/basesrc/sbin/dump/rcache.c,v
retrieving revision 1.4
diff -u -r1.4 rcache.c
--- rcache.c    1999/10/01 04:35:23     1.4
+++ rcache.c    2000/08/24 22:18:52
@@ -139,6 +139,7 @@
                    sizeof(struct cdesc) * cachebufs;
 
                memset(shareBuffer, '\0', sharedSize);
+               memset(cdata, (char) 0xff, nblksread *cachebufs*dev_bsize);
        }
 }

/*-----------------------------------------------------------------------*/

If the machine is performing other tasks (ie large compiles) there is a
higher change of data corruption.   Also, 'dump -k 16' does not corrupt data
whereas the default (-k 32) does.  If your test works first time around you
might want to try -r 512  to allocate 512k in the mmap memory segment.

I would be interested in hearing back reports about other Mips machines
In particular those R4000 based.  If we can cover all of the ports a
better picture might start to emerge as to the cause.

This is not the kind of bug we want lurking on a production system!!!!

Any feedback will be appreciated.

Wayne
-- 
  _____	   	Wayne Knowles,  Systems Manager
 / o   \/   	National Institute of Water & Atmospheric Research Ltd
 \/  v /\   	P.O. Box 14-901 Kilbirnie, Wellington, NEW ZEALAND
  `---'     	Email:   w.knowles@niwa.cri.nz