Port-sparc64 archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Severe deadlock issues with 5.0/MP



On Thu, 5 Feb 2009, Michael wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

On Feb 5, 2009, at 3:48 AM, Anders Lindgren wrote:

- Variants of build.sh [...] tools kernel=GENERIC.MP distribution:

a) build.sh -j 8 and output to console: hang within minutes.
b) build.sh      and output to console: hangs after a few hours
c) build.sh -j 8 > mk.log 2>&1 without tail: same as a)
d) build.sh      > mk.log 2>&1 without tail: see below

The first run of (d) stopped after a few hours with a zombie process named "(sparc64--netbsd)" (truncated name, but the logfile suggests the command was a sparc64--netbsd-install of some html documentation).

Do you have /anything/ involved with the build on an nfs filesystem?

I was actually able to ^C the build and restart it with build.sh -u, which seemed to crawl along -- but subjectively very slow; how long is a "distribution" supposed to take on a 400MHz USII, on the ballpark scale?

My dual 450MHz U60 usually does a full build in less than 8 hours.
What also helps is -pipe in CFLAGS.

Is there any chance that this is raidframe && MP-related? If it could be relevant, I'll install a fresh 5.0_RC1 on a single disk and try again. But I suppose I should try a LOCKDEBUG+DIAGNOSTICS kernel first...

My U60 has everything except a small partition for kernels on a raidframe stripeset. No problems with that.

So I got around to play a little bit more; I booted _RC1 GENERIC.MP again and downloaded _RC2 from netbsd.org and successfully updated the system, so it certainly survives light, non-concurrent loads -- at least for a while.

For plain comparison, I tried to build a GENERIC.MP kernel with DIAGNOSTIC and LOCKDEBUG enabled, using build.sh -j 8 ... tools kernel=GENERIC.MP.
It deadlocked in 9min.

Transcript from the ensuing monologue on IRC :-)

---8<----8<----
23:19 < ali-> hrrm interesting.. now I got latest rc2 to lockup my sparc64
              machine again.. all processes are locked up, but I get blank
              lines if I press return both on the serial console and the ssh
              session, so not all is dead.
23:19 < ali-> maybe just everything doing io...
23:20 < ali-> responds to ping, too.. but serial console breaks just yields a
              "^@" on the console...
---8<----8<----

Interestingly enough, I could ^C the foreground program running in my shell and got a prompt back. Trying to run any new program results in uninterruptible deadlock though -- but I still get blank lines if I press CR. Looks like everything deadlocks on disk io. The box still responds to ping. Trying to ssh again into the box results in successful (TCP-wise) connection establishment, but no ssh handshake is ever received from the server.

I'm going to replace the kernel with a DIAGNOSTIC+LOCKDEBUG (should I add "DEBUG" and/or any of the other, or will that just produce more noise?) when I can get one that boots (see below) and see what gives...

I have a few quick questions though:

- The kernels I cross-build on my amd64 won't boot. The bootloader says "read short header" on the first attempt to load any such kernel, and subsequently just won't load any other kernel built on the same box (but without issuing the "short header" message again). Does that make sense to anyone? [0] The kernels look fine to objdump -h and file(1).

- Serial console break doesn't work when NetBSD has assumed control. It always used to, and still works in OFW to e.g. interrupt boot. But somehow NetBSD GENERIC.MP won't break to ddb (but see above about those funny "^@":s). Very annoying, since I do have a serial console attached, but can't easily run down and power-cycle the box.

- If I type too fast or paste things into the serial console window (which I am accessing over ssh), lots of characters simply vanish, or are distorted. Anyone seen anything similar? I'm not sure whether I should suspect the kernel, or the unusual ssh -> lab firewall -> ssh -> console server -> cu -> serial line -> E3000 arrangement, but it's sure dropping and corrupting characters like nothing else. Doesn't seem to happen over ssh, only serial console.

[0] Though I have recently noticed single-byte file corruption on two occassions on that box. It's very stable and in use 24/7, so it doesn't quite sounds like corrupt RAM, but double-checking SMART status on the drives reported positive reliability for all. Going to give it a day of memtest86+ for good measure...

Best regards,
ali:)


Home | Main Index | Thread Index | Old Index