Subject: Re: strange coredump during telnet compile
To: Mykal Funk <netbsd-help@NetBSD.org>
From: Timothy A. Musson <timothy.musson@zin-tech.com>
List: netbsd-help
Date: 01/27/2004 16:42:44
At 03:59 PM 1/27/04 , Mykal Funk wrote:
>I am unsure if this is the correct list for this question. If i erred
>in this regard, please inform me as to the proper list for my query.
>
>I run a custom NetBSD 1.6.1 kernel on a Cyrix 486DCL machine. I 
>successfully compiled the kernel with 8 MB of core and 518 MB of disc
>space. A couple of days ago, i salvaged an old hulk that someone 
>dumped at the local recycle center. After inspecting the hardware and
>determining what was usable and what was paperweights, i removed the
>two 4 MB core chips and installed a single 16 MB chip. I also grafted 
>an additional 1.6 GB harddisc onto the system. 
>
>Since i don't have an extra monitor for the box, i access it via 
>telnet from my MegaKludge95 box. [I won't bore you with the long drawn
>out reasons it takes forever to unlock yourself from the clutches of 
>the Evil Empire once you have become a Convert of the Unix Way. ;)] I 
>find that this set up allows me to transfer my data from proprietary 
>binary data formats to Unix by using the "cut and paste" method of data
>conversion via the telnet window.

Since you brought this up: Are these M$ Office documents? If so, try
OpenOffice.org.

>This long explanation is needed to explain the background of my problem.

Not really. Many people on this list remote-admin machines or have headless
machines at home. :)

>When i compile code from pkgsrc, about midway through the process, the

What code?

>machine dumps core and dies. I have to snag a monitor cable from a 

And you say this why? Did you see the core dump and relating info before
you lost your telnet connection, or did you just get disconnected? Did it
not respond to another telnet attempt? If you saw a core dump, you should
capture it and post it.

>nearby computer and reboot. Inevitable all the partitions are trashed 
>and the one i was working on is particularly smashed. Mercifully once 
>all the fs checks are complete and i 'fsck_ffs /dev/rwd1a' the really
>hosed disc i can reboot and its life as usual.
>
>I have four originally had suspects with this odd behavior.
>
> Suspect A: M$ telnet client

I wouldn't think that the telnet client should cause such a thing, but you
could download a 3rd-party telnet client and see if you get the same
behavior (try tucows.com and/or freshmeat.net).

> Suspect B: the recently added core
> Suspect C: the recently added harddisc

I'd lump these under "Hardware problem", and add to it "Possibly abused
motherboard"; it was probably tossed rather roughly by whoever junked it.
Old memory and HD are always suspect.

> Suspect D: a subtle bug somewhere in the kernel related to my setup

I couldn't speculate here.

>However, when i used the commandeered monitor and typed keystroke for 
>keystroke directly into the box, the same thing happened. I got alot 
>farther while physically on the box. The screen reported 'Memory Fault 
>(core dumped)' and gave me a new prompt. No crash. No flames. No smoke.
>Just 'Next command please..."

Now I'm confused. If "the same thing happened", then the box core dumped,
died, and you had to reboot it. Also, if "the same thing happened", then it
died at exactly the same keystroke/compile/whatever that it did the first
time. What does "alot further" really mean? The compile went further? It
completed and you were able to move on to another set of operations? Also,
was there any other info besides "Memory Fault (core dumped)"?

>Does anyone have an suggestions on how i can track down the source of 

There's a program called memtest86 (www.memtest86.com) that stress-tests
memory. I'm sure others can tell of similar HD or general hardware tests.

>this odd behavior? Can anyone explain why the same exact procedure via 
>telnet crashes the box when working directly on the box has no such effect?

From your account, I'm not convinced the box crashed the first time (you
weren't specific enough with your report; i.e. no screen dump / exact error
messages). It sounds like what might of happened is the problem hosed your
telnet session (maybe telnetd, too) and you then powered the box off. Also,
you did get an error while on the box directly (during a compile?). Did you
redo the steps a number of times (at least 3) to see if you got a similar
error, or if the box would crash (rather than just give a Memory Fault
error) on a re-attempt?


Note that I probably won't be able to offer much direct advice on how to
fix or even troubleshoot things; I'm just trying to help you get help ;)

Good luck.

-Tim