Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Crashes with large SuperMicro based server.



Lars-Johan Liman <liman%cafax.se@localhost> writes:

> I'm trying to run a NetBSD based Xen server on a 3-year-old server with
> a SuperMicro X8DTU/X8DTU-F motherboard with 2x Intel(R) Xeon(R) CPU
> E5620 @ 2.40GHz (in total 16 cores), and 24 GB of RAM. It also sports an
> Areca RAID board, which isn't recognized, but that's a different
> story. It boots both 6.1.5 and 7.0_BETA (GENERIC) nicely (for some value
> thereof ...), as long as there's no Xen involved, and I use the on-board
> (non-RAID) disk controller.

Good, that is indeed what you should have verified first before xen.

Presumably you are using amd64.

> If I try to put a Xen hypervisor in there, the hypervisor itself seems
> to boot nicely, but the following netbsd-XEN3_DOM0 will either crash
> with a stacktrace, or just hang and go blank. I've tried both xen41 and
> xen45. No cigar.
>
> When using the xen41 hypervisor with the 7.0_BETA dom0 kernel, it
> repeats the following message in a burst, then hangs for a few seconds,
> then blanks the screen and goes catatonic, with the fans revving up to
> max.
>
> fatal page fault in supervisor mode
> trap type 6 code 2 rip ffffffff80540792 cs e030 rflags 10246 cr2 0 ilevel 8 rsp
> ffffffff81048c30
> curlwp 0xffffffff80c3c420 pid 0.1 lowest kstack 0xffffffff810482c0
> kernel: page fault trap, code=0

It seems there is a bug in the kernel someplace, probably in a driver
(because it works fine for most other people), that is somehow tickled
with xen and your hardware.

Suggestions more or less in order of increasing difficulty:

1) Look up the program counter above (rip) in the kernel binary.  One
way is to run gdb and then "disass 0xffffffff80540792" to find the
function it's in.

2) photograph/video the boot screen to find the netbsd kernel messages
preceding the hang.  The key point is to know where in the boot sequence
it is.  Both the driver that printed the last line that came out and the
driver after that are suspect (compare to non-xen boot).

3) Use a serial console and capture the output from xen and netbsd
before the crash.

4) Figure out how to do remote gdb on the netbsd kernel.  I am not sure
how to do this in xen.

> Are there any limitations I should know about (# of cores, max mem)?

Not that I know of (that you're close to; if you had 256 cores and 1024G
of RAM I would not be sure).

> Are there any BIOS settings that I need to check? (CPU flags?)

I would try disabling SMP in the bios, so that you boot with one core.
Probably that's not it, but it's easy to try.

> Are there any combos of hypervisor and kernel that are less or more
> likely to work?

Hard to say, but xen41 and xen45 are good versions to try.  I would
suggest trying to boot a netbsd-6 DOM0 kernel also.  I don't think it's
likely to work better, but it's an easy test.

> Are there BIOS devices that can get in the way and should be
> removed/disabled? (USB, COM, IPMI ...)?

Not really, but you could try to turn off everything that isn't
necessary.  IPMI I would leave on.

> Should I look at the PCI bus? (RAID board)?

It's unlikely that an unrecognized PCI device would cause trouble.
(Note that I am saying "unlikely"; there are more or less no
certainties.)

> Are there any settings I need add/remove/change in /boot.cfg? My current
> attempt looks like this (all on one line):
>
> menu=Xen:load /netbsd-XEN3_DOM0 console=pc;multiboot /xen.gz
>  dom0_mem=512M,max:512M dom0_max_vcpus=1 dom0_vcpus_pin=true
>  bootscrub=false

I am unfamiliar with bootscrub; try without.

I don't see anything scary in your non-xen dmesg.

Attachment: pgpKYVHPSjgCU.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index