Mixed -current MP results

To: port-sparc%NetBSD.org@localhost
Subject: Mixed -current MP results
From: Hauke Fath <hauke%Espresso.Rhein-Neckar.DE@localhost>
Date: Fri, 6 May 2011 10:19:12 +0200

All,

so I've taken the plunge, and upgraded my netbsd-4 SPARCstation 20 (2x
SM71) to -current two weeks ago. Mixed results, quite mixed.

On the up side, I don't see any "random" userland crashes. And when the
machine crashes, it doesn't lock up, as netbsd-4 used to, but reboots. And
it reboots quickly, thanks to "-o log", instead of spending fscking 20
minutes on the 70 GB disk. Most of the installed netbsd-4 pkgsrc userland
is fine, with the notable exception of sendmail dumping core, squid working
on 5_99_49, but silently failing on 5_99_51, and XEmacs being dodgy.

Kudos to those who pulled NetBSD/sparc kicking and screaming back to a
usable state.

On the down side... if the machine doesn't keel over running the daily /
security script, it will certainly die during the following Amanda backup
run. So far, I got one (1) successful Amanda run out of the last two weeks.
The (fairly reproducible) panic is

<snip>
Mutex error: mutex_vector_enter: locking against myself

lock address : 0x00000000f4ce8170 type     :     sleep/adaptive
initialized  : 0x00000000f02258f4
shared holds :                  0 exclusive:                  0
shares wanted:                  0 exclusive:                  2
current cpu  :                  0 last held:                  0
current lwp  : 0x00000000f358d580 last held: 000000000000000000
last locked  : 0x00000000f0214bec unlocked*: 0x00000000f0214c48
owner field  : 0x00000000f358d580 wait/spin:                1/0

Turnstile chain at 0xf02e572c.
=> Turnstile at 0xf358e9d8 (wrq=0xf358e9e8, rdq=0xf358e9f0).
=> 0 waiting readers:
=> 1 waiting writers: 0xf359a8e0

panic: LOCKDEBUG
cpu0: Begin traceback...
0x0(0xf4ccc240, 0x0, 0xf02782d0, 0xf0296050, 0x1, 0xf02e3800) at
netbsd:mutex_enter+0x364
mutex_enter(0xf4ce8170, 0xf358d580, 0xf4ce8170, 0xf02e45d0, 0xf02db218, 0x1) at
netbsd:biodone2+0x8
biodone2(0xf12cbc88, 0x0, 0x0, 0x0, 0x51f06dbf, 0xa0000020) at
netbsd:biointr+0x44
biointr(0x0, 0x0, 0xf00e99c4, 0x0, 0xf358cd40, 0xf00027e0) at
netbsd:softint_thread+0x74
softint_thread(0xf3630008, 0xf358d580, 0xf02d13b0, 0xf02cb7c0, 0xf3582974,
0xf02e0d2e) at netbsd:lwp_setfunc_trampoline
cpu0: End traceback...
Frame pointer is at 0xf3635c00
Call traceback:
  pc = 0xf00fe2b0  args = (0xf02dd224, 0x0, 0xf02dd224, 0xf02d0400, 0x75,
0xffffffff, 0xf3635c68) fp = 0xf3635c68
  pc = 0xf01ad1e0  args = (0x104, 0x0, 0xefffffff, 0xf3635f20, 0xf01ac4dc,
0x1,0xf3635cd8) fp = 0xf3635cd8
  pc = 0xf01a55b4  args = (0xf02a63e0, 0xf01ac594, 0xf02782d0, 0xf02c8800,
0xf02cdc00, 0x104, 0xf3635d48) fp = 0xf3635d48
  pc = 0xf00d96c0  args = (0xf4ccc240, 0x0, 0xf02782d0, 0xf0296050, 0x1,
0xf02e3800, 0xf3635db0) fp = 0xf3635db0
  pc = 0xf0214bec  args = (0xf4ce8170, 0xf358d580, 0xf4ce8170, 0xf02e45d0,
0xf02db218, 0x1, 0xf3635e18) fp = 0xf3635e18
  pc = 0xf0214cf8  args = (0xf12cbc88, 0x0, 0x0, 0x0, 0x51f06dbf,
0xa0000020, 0xf3635e80) fp = 0xf3635e80
  pc = 0xf00e9898  args = (0x0, 0x0, 0xf00e99c4, 0x0, 0xf358cd40,
0xf00027e0, 0xf3635ee8) fp = 0xf3635ee8
  pc = 0xf0007d28  args = (0xf3630008, 0xf358d580, 0xf02d13b0, 0xf02cb7c0,
0xf3582974, 0xf02e0d2e, 0xf3635f50) fp = 0xf3635f50
  pc = 0x0  args = (0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) fp = 0x0

dump to dev 7,1 not possible
sd0: cache synchronization failed
rebooting
</snip>

After reboot, I get a mildly disquieting

<snip>
WARNING: negative runtime; monotonic clock has gone backwards
</snip>

Every now and then, I see dig(1) and nsupdate(8) busy-looping at 100% cpu,
and have to "kill -9" them. named(8) generally seems to be dodgy, and
should probably be built without threads on sparc.

Last, but not least: -current is slow. The machine runs a custom !DEBUG,
!DIAGNOSTIC, LOCKDEBUG kernel. The /etc/daily cron job used to take about
an hour, now it's about four hours. Typical Amanda back up times:

<snip>
amanda run on -current

HOSTNAME     DISK        L  ORIG-KB   OUT-KB COMP%  MMM:SS   KB/s  MMM:SS
KB/s
-------------------------- --------------------------------------
--------------
pizza        ccd0b       1     5940     5940   --     0:45  132.3    N/A
N/A
pizza        ccd0d       1    10977      955   8.7    1:57    8.2    N/A
N/A
pizza        ccd0e       0 12103632  2679703  22.1  403:21  110.7    N/A
N/A


amanda run on netbsd-4

HOSTNAME     DISK        L  ORIG-KB   OUT-KB COMP%  MMM:SS   KB/s  MMM:SS
KB/s
-------------------------- --------------------------------------
--------------
pizza        ccd0b       1     5940     5940   --     0:08  719.5    N/A
N/A
pizza        ccd0d       1    10967      953   8.7    0:35   26.9    N/A
N/A
pizza        ccd0e       0 12105495  2681853  22.2  290:03  154.1    N/A
N/A
</snip>

Generally, I see a much higher percentage of system time than I've been
used to even under moderate load, and - a bit disturbing - a much higher
percentage of interrupt time, especially during disk activity (which seems
to be slower than netbsd-4, see the Amanda numbers), with spikes up to 30%.
I don't know, though, if the latter is a property of the -current MP kernel
changes, or a quirk of MD sparc code.

Given the loss of speed, I am seriously thinking about going back to
netbsd-4. I'll sorely miss "-o log", though...

Comments?

        hauke




--
"It's never straight up and down"     (DEVO)

Follow-Ups:
- Re: Mixed -current MP results
  - From: Hauke Fath

Prev by Date: Dimagrire senza fatica
Next by Date: Re: Mixed -current MP results
Previous by Thread: Dimagrire senza fatica
Next by Thread: Re: Mixed -current MP results
Indexes:

Home | Main Index | Thread Index | Old Index