Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Crashes with large SuperMicro based server.



Greg,

gdt%ir.bbn.com@localhost:
> Good, that is indeed what you should have verified first before xen.

I did. Before Xen. :-)

> Presumably you are using amd64.

I am indeed.

> It seems there is a bug in the kernel someplace, probably in a driver
> (because it works fine for most other people), that is somehow tickled
> with xen and your hardware.

Tobias Nygren (in a private mail) seems to have nailed it down. He
hinted that there might be a problem with the Areca RAID controller not
being recognized properly, and since I don't use it, i unplugged it from
the PCI bus, and the machine came up - WITH Xen. Tobias also proposed a
patch, which I haven't had time to install yet, but I'll try to build a
kernel with the patch later in the week.

> Suggestions more or less in order of increasing difficulty:

Thanks for all of these!

> 1) Look up the program counter above (rip) in the kernel binary.  One
> way is to run gdb and then "disass 0xffffffff80540792" to find the
> function it's in.

Ack.

> 2) photograph/video the boot screen to find the netbsd kernel messages
> preceding the hang.  The key point is to know where in the boot sequence
> it is.  Both the driver that printed the last line that came out and the
> driver after that are suspect (compare to non-xen boot).

(I actually tried that, but my handy camera proved broken, and my phone
wasn't good enough to catch the rapid scrolling on the screen.)

> 3) Use a serial console and capture the output from xen and netbsd
> before the crash.

That would have been my next step, but it would require a lot of
fiddling ...

> 4) Figure out how to do remote gdb on the netbsd kernel.  I am not sure
> how to do this in xen.

A good challenge! :-)

>> Are there any limitations I should know about (# of cores, max mem)?

> Not that I know of (that you're close to; if you had 256 cores and 1024G
> of RAM I would not be sure).

Only in my dreams ... :-)

>> Are there any BIOS settings that I need to check? (CPU flags?)

> I would try disabling SMP in the bios, so that you boot with one core.
> Probably that's not it, but it's easy to try.

I'll save that one for the future.

>> Are there any combos of hypervisor and kernel that are less or more
>> likely to work?

> Hard to say, but xen41 and xen45 are good versions to try.  I would
> suggest trying to boot a netbsd-6 DOM0 kernel also.  I don't think it's
> likely to work better, but it's an easy test.

I'll save that one too.

>> Are there BIOS devices that can get in the way and should be
>> removed/disabled? (USB, COM, IPMI ...)?

> Not really, but you could try to turn off everything that isn't
> necessary.  IPMI I would leave on.

Ack.

>> Should I look at the PCI bus? (RAID board)?

> It's unlikely that an unrecognized PCI device would cause trouble.
> (Note that I am saying "unlikely"; there are more or less no
> certainties.)

It does indeed seem to be the the case, though. But I fully agree. It's
a surprise to mee too!

> I am unfamiliar with bootscrub; try without.

(Disabling it saves time on large-mem machines when you boot. I've used
it successfully on Xen/Debian.)

> I don't see anything scary in your non-xen dmesg.

Again, thanks for all good hints!

				Cheers,
				  /Liman


Home | Main Index | Thread Index | Old Index