Subject: 2.0_RC4 and -current instability, data corruption and system hang ups
To: None <port-cobalt@netbsd.org, port-mips@netbsd.org>
From: Markus W Kilbinger <kilbi@rad.rwth-aachen.de>
List: port-cobalt
Date: 11/04/2004 17:06:05
Hi!

After playing about 2 weeks with my 'new' qube2 I couldn't get rid of
the following problem with -current (2.99.10 as of the last days) and
2.0_RC4 kernels/userlands (cross compiled on NetBSD/i386):

- Corrupted files/data streams mostly at 32 byte boundaries (and for
  32 bytes length), quite randomly spread, for writing and reading
  files.

  This only seems to happen under heavy load, especially when combined
  with different i/o media (e. g. ata and network). A simple example I
  am seeing at the moment is when comparing 'identical' copies of my
  cobalts 2.0_RC4 iso image (about 109 mb); I've just copied my
  original netbsd-cobalt.iso file multiple times under different names
  into the same filesystem of my qube2's harddisk and compared them
  (cmp -l [version1] [version2]) -> I see constant (happend during
  writing) and randomly appearing (during reading) mostly 32 bytes
  diffs of these files!?

  I hope it's not an issue of my hardware, pkgsrc/sysutils/memtester
  at least passed multiple times w/o any error... (how to selectively
  test ata i/o?)

  Does anybody else see this problem?

  Are there known issues with mips/cobalt pmap/uvm/ubc stuff?

- Sudden hangups under heavier load, mostly under continuous i/o
  traffic (ata and/or network, beyond 100 mb data volume). There is no
  panic message, no way to get into ddb (-> manual reset necessary).

  Maybe it's related to the same problem as before, but in a fatal
  variant...


BTW: i/o performance is quite moderate of my qube2 (ftp or scp are
limited to 300-400 kB/sec for writing onto qube2's disk; reading is a
bit faster (about 1 MB/sec), but still far below pure i/o
limitations). I don't know if this is an issue, too, or just some kind
hardware limitation...


I've tried many kernels and userlands (-current and 2.0_RC4) w/ and
w/o different opts (-mips{2,3} -mtune=r5000, w/ and w/o MIPS3_PLUS),
it's always the same instability. Things like SOFTDEP and
NEW_BUFQ_STRATEGY worsen the situation in that the hang up comes up
earlier.

Under low load the machines seems to run fine, though ...

Is there anything else I can test to narrow the problem?

Any hint/help appreciated,

Markus.