Subject: Re: strange coredump during telnet compile (long)
To: Timothy A. Musson <timothy.musson@zin-tech.com>
From: Mykal Funk <mykal@sccoast.net>
List: netbsd-help
Date: 01/27/2004 18:56:19
On Tuesday, January 27, 2004 16:42 EST, "Timothy A. Musson"
<timothy.musson@zin-tech.com> wrote:

> At 03:59 PM 1/27/04 , Mykal Funk wrote:

> >Since i don't have an extra monitor for the box, i access it via
> >telnet from my MegaKludge95 box. [I won't bore you with the long drawn
> >out reasons it takes forever to unlock yourself from the clutches of
> >the Evil Empire once you have become a Convert of the Unix Way. ;)] I
> >find that this set up allows me to transfer my data from proprietary
> >binary data formats to Unix by using the "cut and paste" method of data
> >conversion via the telnet window.
>
> Since you brought this up: Are these M$ Office documents? If so, try
> OpenOffice.org.

Yes and No. Some of the data is in M$ Office, some in other oddball formats.
As for X... my tastes run to older hardware and command line interfaces.

> >This long explanation is needed to explain the background of my problem.
>
> Not really. Many people on this list remote-admin machines or have
headless
> machines at home. :)

Being new to the Unix Community, i am a little unfamiliar with such common
things as 'headless machines.' Lets put it metaphorically. I'm an ex-Muggle
who just found out he is really a Wizard. (think Harry Potter) :) I'm still
learning what 'normal' is in the brave new world.

> >When i compile code from pkgsrc, about midway through the process, the
>
> What code?

anything... i was working with pine and fetchmail, in particular

> >machine dumps core and dies. I have to snag a monitor cable from a
>
> And you say this why? Did you see the core dump and relating info before
> you lost your telnet connection, or did you just get disconnected? Did it
> not respond to another telnet attempt? If you saw a core dump, you should
> capture it and post it.
>
> >nearby computer and reboot. Inevitable all the partitions are trashed
> >and the one i was working on is particularly smashed. Mercifully once
> >all the fs checks are complete and i 'fsck_ffs /dev/rwd1a' the really
> >hosed disc i can reboot and its life as usual.

If i'm telnetting, the session simple stops responding. No error messages
The machine refuses to respond to a request for another session. When i
go to the box and physically look at what box is printing to the screen,
i find that is has rebooted itself. The machine has already dropped into
single user mode with warning messages scattered through the boot messages.
I run the requested command, reboot back into multiuser mode and everything
is fine. I admit that i am unsure as to how i can capture the what has
already
been printed to the screen after the dmegs output but before the login
prompt.

> >I have four originally had suspects with this odd behavior.
> >
> > Suspect A: M$ telnet client
>
> I wouldn't think that the telnet client should cause such a thing, but you
> could download a 3rd-party telnet client and see if you get the same
> behavior (try tucows.com and/or freshmeat.net).

I'll download one and see if i can replicate the crash.

> > Suspect B: the recently added core
> > Suspect C: the recently added harddisc
>
> I'd lump these under "Hardware problem", and add to it "Possibly abused
> motherboard"; it was probably tossed rather roughly by whoever junked it.
> Old memory and HD are always suspect.
>
> > Suspect D: a subtle bug somewhere in the kernel related to my setup
>
> I couldn't speculate here.

Suspect B+C: the motherboard from the junkyard system would not power on.
It was a Cyrix x686 anyhow. The core and harddisc was salvaged from that
system and added to a preexisting box as an 'upgrade'. And it appeared
to have been set down rather gently.

> >However, when i used the commandeered monitor and typed keystroke for
> >keystroke directly into the box, the same thing happened. I got alot
> >farther while physically on the box. The screen reported 'Memory Fault
> >(core dumped)' and gave me a new prompt. No crash. No flames. No smoke.
> >Just 'Next command please..."
>
> Now I'm confused. If "the same thing happened", then the box core dumped,
> died, and you had to reboot it. Also, if "the same thing happened", then
it
> died at exactly the same keystroke/compile/whatever that it did the first
> time. What does "alot further" really mean? The compile went further? It
> completed and you were able to move on to another set of operations? Also,
> was there any other info besides "Memory Fault (core dumped)"?

After crashing the box via telnet three or four time, i enter the same
commands that caused the telnet crash, directly into the system. The
compile of fetchmail preceded further that when telnet stopped responded.
then the system printed 'Memory Fault (core dumped)'. Nothing else. I
looked around the system for a file name *.core, as i had found on a
previous another. Perhaps i am just too inexperienced to properly capture
a core dump.

> >Does anyone have an suggestions on how i can track down the source of
>
> There's a program called memtest86 (www.memtest86.com) that stress-tests
> memory. I'm sure others can tell of similar HD or general hardware tests.
>
> >this odd behavior? Can anyone explain why the same exact procedure via
> >telnet crashes the box when working directly on the box has no such
effect?

I've download the software, but i have to wait until i've sent this message
so
i can build it.

> From your account, I'm not convinced the box crashed the first time (you
> weren't specific enough with your report; i.e. no screen dump / exact
error
> messages). It sounds like what might of happened is the problem hosed your
> telnet session (maybe telnetd, too) and you then powered the box off.
Also,
> you did get an error while on the box directly (during a compile?). Did
you
> redo the steps a number of times (at least 3) to see if you got a similar
> error, or if the box would crash (rather than just give a Memory Fault
> error) on a re-attempt?

ooooh noooo... As soon as i had confirmed that telnet 95 wasn't responding,
i went to see what the box was doing. I put a head on the machine.

> Note that I probably won't be able to offer much direct advice on how to
> fix or even troubleshoot things; I'm just trying to help you get help ;)
>
> Good luck.
>
> -Tim

Tim, you've been a great help. As i have written this reply, i have been
able to more clearly define the problem. The result is that i have learned
several very good troubleshooting techniques and discovered the problem.
The telnet client that shipped with bill95 is buggy. Somehow it causes
telnetd to choke while under the heavy load of compiling software. I
think the best and easiest solution is to chuck that telnet client and
use another one until i am ready to chuck this OS.

Thanks,
mykalFunk