Re: Severe deadlock issues with 5.0/MP

To: Michael <macallan%NetBSD.org@localhost>
Subject: Re: Severe deadlock issues with 5.0/MP
From: Anders Lindgren <ali%df.lth.se@localhost>
Date: Wed, 25 Feb 2009 10:19:04 +0100 (CET)

On Thu, 5 Feb 2009, Michael wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

On Feb 5, 2009, at 3:48 AM, Anders Lindgren wrote:
- Variants of build.sh [...] tools kernel=GENERIC.MP distribution:

a) build.sh -j 8 and output to console: hang within minutes.
b) build.sh      and output to console: hangs after a few hours
c) build.sh -j 8 > mk.log 2>&1 without tail: same as a)
d) build.sh      > mk.log 2>&1 without tail: see below
The first run of (d) stopped after a few hours with a zombie process named"(sparc64--netbsd)" (truncated name, but the logfile suggests the commandwas a sparc64--netbsd-install of some html documentation).
Do you have /anything/ involved with the build on an nfs filesystem?
I was actually able to ^C the build and restart it with build.sh -u, whichseemed to crawl along -- but subjectively very slow; how long is a"distribution" supposed to take on a 400MHz USII, on the ballpark scale?
My dual 450MHz U60 usually does a full build in less than 8 hours.
What also helps is -pipe in CFLAGS.
Is there any chance that this is raidframe && MP-related? If it could berelevant, I'll install a fresh 5.0_RC1 on a single disk and try again. ButI suppose I should try a LOCKDEBUG+DIAGNOSTICS kernel first...
My U60 has everything except a small partition for kernels on a raidframestripeset. No problems with that.

So I got around to play a little bit more; I booted _RC1 GENERIC.MP againand downloaded _RC2 from netbsd.org and successfully updated the system,so it certainly survives light, non-concurrent loads -- at least for awhile.

For plain comparison, I tried to build a GENERIC.MP kernel with DIAGNOSTICand LOCKDEBUG enabled, using build.sh -j 8 ... tools kernel=GENERIC.MP.

It deadlocked in 9min.

Transcript from the ensuing monologue on IRC :-)

---8<----8<----
23:19 < ali-> hrrm interesting.. now I got latest rc2 to lockup my sparc64
              machine again.. all processes are locked up, but I get blank
              lines if I press return both on the serial console and the ssh
              session, so not all is dead.
23:19 < ali-> maybe just everything doing io...
23:20 < ali-> responds to ping, too.. but serial console breaks just yields a
              "^@" on the console...
---8<----8<----

Interestingly enough, I could ^C the foreground program running in myshell and got a prompt back. Trying to run any new program results inuninterruptible deadlock though -- but I still get blank lines if I pressCR. Looks like everything deadlocks on disk io. The box still responds toping. Trying to ssh again into the box results in successful (TCP-wise)connection establishment, but no ssh handshake is ever received from theserver.

I'm going to replace the kernel with a DIAGNOSTIC+LOCKDEBUG (should I add"DEBUG" and/or any of the other, or will that just produce more noise?)when I can get one that boots (see below) and see what gives...


I have a few quick questions though:

- The kernels I cross-build on my amd64 won't boot. The bootloader says"read short header" on the first attempt to load any such kernel, andsubsequently just won't load any other kernel built on the same box (butwithout issuing the "short header" message again). Does that make sense toanyone? [0] The kernels look fine to objdump -h and file(1).

- Serial console break doesn't work when NetBSD has assumed control. Italways used to, and still works in OFW to e.g. interrupt boot. But somehowNetBSD GENERIC.MP won't break to ddb (but see above about those funny"^@":s). Very annoying, since I do have a serial console attached, butcan't easily run down and power-cycle the box.

- If I type too fast or paste things into the serial console window (whichI am accessing over ssh), lots of characters simply vanish, or aredistorted. Anyone seen anything similar? I'm not sure whether I shouldsuspect the kernel, or the unusual ssh -> lab firewall -> ssh -> consoleserver -> cu -> serial line -> E3000 arrangement, but it's sure droppingand corrupting characters like nothing else. Doesn't seem to happen overssh, only serial console.

[0] Though I have recently noticed single-byte file corruption on twooccassions on that box. It's very stable and in use 24/7, so it doesn'tquite sounds like corrupt RAM, but double-checking SMART status on thedrives reported positive reliability for all. Going to give it a day ofmemtest86+ for good measure...


Best regards,
ali:)

Follow-Ups:
- Re: Severe deadlock issues with 5.0/MP
  - From: Jochen Kunz

References:
- re: Severe deadlock issues with 5.0/MP
  - From: matthew green
- re: Severe deadlock issues with 5.0/MP
  - From: Anders Lindgren
- Re: Severe deadlock issues with 5.0/MP
  - From: Michael

Prev by Date: Re: Package binaries for NetBSD/sparc64 4.0 / pkgsrc-2008Q4
Next by Date: Re: Severe deadlock issues with 5.0/MP
Previous by Thread: Re: Severe deadlock issues with 5.0/MP
Next by Thread: Re: Severe deadlock issues with 5.0/MP
Indexes:

Home | Main Index | Thread Index | Old Index