Xen boot strangeness (Was: Re: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen))

To: NetBSD Users Discussion List <netbsd-users%netbsd.org@localhost>
Subject: Xen boot strangeness (Was: Re: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen))
From: Chuck Zmudzinski <frchuckz%gmail.com@localhost>
Date: Thu, 29 May 2025 15:01:50 -0400

On 5/27/2025 7:37 PM, Greg A. Woods wrote:
> At Tue, 27 May 2025 12:39:43 -0400, Chuck Zmudzinski <frchuckz%gmail.com@localhost> wrote:
> Subject: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
>>
>> It is definitely not production ready, but I got it to work with the following
>> tweaks and hacks.
>>
>> boot command used:
>>
>> menu=Boot normally with Xen:dev hd2d:;load /netbsd-XEN3_DOM0.gz -c console=xencons bootdev=wd1;multiboot /xen.gz dom0_mem=2G dom0_max_vcpus=4 com2=9600,8n1,0x40c0,16,1:0.0 console=com2 cet=no-ibt pv-l1tf=false
>>
>> I also needed to pass -c to the NetBSD dom0 kernel so I could disable com*
>> interactively using userconf at boot time. Without doing this, the NetBSD dom0
>> panics when using the serial console for Xen. I could not get the kernel to
>> invoke userconf to disable com* by any setting in boot.cfg; it was necessary
>> to pass -c and disable com* interactively at boot time.
> 
> That's very strange.

It is very strange! I did some tests to document the "strangeness".

I have userconf=disable com* at the bottom of boot.cfg.

When I boot normally without Xen, the bootloader executes the
userconf statement in boot.cfg and disables com* as expected:

netbsd# dmesg | grep com2
netbsd# dmesg | grep com
[     1.000000]         mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/amd64/compile/GENERIC
[     1.000000] [   139.000000] com* disabled
[     1.000000] [   140.000000] com* disabled
[     1.000000] [   141.000000] com* disabled
[     1.000000] [   142.000000] com* disabled
[     1.000000] [   143.000000] com* disabled
[     1.000000] [   144.000000] com* disabled
[     1.000000] [   145.000000] com* disabled
[     1.051210] puc0 at pci1 dev 0 function 0: Nanjing QinHeng Electronics CH382 (com, com)
[     1.051210] com at puc0 port 0 not configured
[     1.051210] com at puc0 port 1 not configured
[     1.051210] com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, 16-byte FIFO
netbsd#

Note this is the GENERIC kernel booted without Xen so it does detect com0 which is
a separate config option from com* in the GENERIC kernel.

Also, the 'com at puc0 port 0 not configured' and 'com at puc0 port 1 not confiugred'
lines are the com2 and com3 ports on the PCI card that are disabled by userconf.

Note also that the com0 port is not an ordinary com port. I think it is an AMT
com port but my BIOS lacks AMT support to access it. There is no ordinary com port
header to connect ordinary com ports on this motherboard. That is why I installed
the PCI serial card with two com ports that show up as com2 and com3 in NetBSD.
I am using com2 as detected by NetBSD to debug the boot. It is also com2 as detected
by Xen. I presume com1 as detected by Xen is the AMT serial port that NetBSD
detects as com0.

Now the strangeness begins...

Then I tried booting with Xen, keeping the userconf=disable com* statement
in place at the bottom of boot.cfg. The bootloader did not execute the
userconf=disable com* command and I got a crash as a result:

[   1.0000030] com2 at puc0 port 0 (16850-compatible): panic: Failed to bind physical IRQ 16

[   1.0000030] cpu0: Begin traceback...

[   1.0000030] vpanic() at netbsd:vpanic+0x177

[   1.0000030] panic() at netbsd:panic+0x3c

[   1.0000030] bind_pirq_to_evtch() at netbsd:bind_pirq_to_evtch+0xa8

... lots of traceback messages and then drop to debugger

netbsd:breakpoint+0x5:  leave

db{0}>

> 
> Perhaps it has something to do with the fact you're using what Xen calls
> "com2" for the serial console.

Yes, that is exactly the problem, and since DOM0 cannot detect which port Xen
is using, our DOM0 config just assumes it is always going to be com0, but in
my case I cannot use com0 as the Xen console so I am using what NetBSD sees as
com2 instead.

> 
> Normally when a COM port is used by Xen for the serial console then it
> won't even be seen by the probe in NetBSD.

Yes, but only if Xen is using com0...

Quoting from the XEN3_DOM0 config file:

# If a com port is used as Xen console it can't be used by the domain0 kernel
# and there's no easy way to detect this yet. Leave com0 out as it's the
# port usually used for the serial console
#com0

end of quote...

So the DOM0 config needs to be smarter about figuring out which com port Xen is
using as the serial port but the comment in the DOM0 config suggests this is not
so easy, and the workaround for now is that if Xen uses a port other than com0,
it is necessary to disable that port when booting DOM0 instead of disabling com0.

So this behavior of DOM0 crashing when com* is not disabled is expected because
I am using a port other than com0 for the Xen serial console.

The strange part is that it is necessary to pass -c to the DOM0 in boot.cfg to
actually disable com* and the bootloader does not execute the userconf=disable com*
command that is present in boot.cfg when booting NetBSD/xen DOM0. The userconf=com*
setting works for the boot without Xen, but with Xen the bootloader ignores that
setting in boot.cfg.

> 
> However it doesn't look like you're using an old-fashioned "standard"
> COM port.  According to the "Xen Serial Console" notes you should
> probably be telling Xen to use "com1", not "com2":
> 
> 	Xen com1= option for non-standard serial ports (IPMI SOL, Intel AMT, PCI serial)
> 
> 	Note that even if your SOL device is, for example, COM3, you
> 	still need to specify "com1=<foo> console=com1" options for Xen.
> 	If you specify "com3=" the serial console won't work!  Remember
> 	to list the correct (actual) serial port IOport and IRQ in the
> 	Xen "com1=" parameters!
> 
> 	https://wiki.xenproject.org/wiki/Xen_Serial_Console

I saw that but...

As I explained earlier, I think Xen sees an AMT com device as com1 (NetBSD
sees it as com0), but I cannot access it because my BIOS does not have the
AMT feature. Therefore I have no choice but to use com2 or com3 which are
connected to the PCI serial card.

> 
>> I also needed to interactively set the root device because no bootdev
>> setting in boot.cfg allowed the NetBSD dom0 kernel to correctly detect
>> the root device.
> 
>> 2. I tried passing the bootdev to the NetBSD kernel as wd1, dk12,
>>    and NAME=<UUID> but it never worked. However, I was able to
>>    interactively set it at boot time:
> 
> That's also very strange.  (note "wd1" would probably never be correct
> given how it appears your disks are partitioned -- you need "dk12")

Yes this is very strange indeed! Here are more details...

Actually, dk12 is the fourth partition on wd1 (using gpt partitioning). When
I pass wd1 as bootdev in boot.cfg the bootloader at least dropped me to a
prompt and allowed me to enter dk12 as the boot device. In this case the
bootloader thinks wd1a is the root device and wd1b is the dump device as
shown in this quote from my earlier message:

> [     5.159642] boot device: wd1
> [     5.159642] root on wd1a dumps on wd1b

but when the bootloader realizes this is wrong it drops me to the prompt where
I could enter dk12 as the root device, dk11 as the dump device, and then enter
the default choices for the filesystem type and init and get a successful boot.
I expect if I add a disklabel to wd1 and set wd1a as the root partition and
wd1b as the swap/dump partition, it would work with the wd1 setting for bootdev
in boot.cfg, but I have not verified that yet. This is not really so strange,
and I probably just need to add the correct NetBSD disklabel to wd1 to fix it.

Here is the strange part:

When I pass bootdev=dk12 in boot.cfg, the bootloader strangely tries dk1 as root
(which is wrong) and correctly detects dk11 as the dump device. But it never
gives me the chance to enter the correct root device and instead tries to load
init which of course it cannot find the NetBSD init on dk1 because dk1 is not
the correct NetBSD root device. In fact on this box a Linux distro is installed
on dk1, as evidenced by the filesystem type detected on dk1: ext2fs.

I tested this twice - it is not a typo. I double-checked that I did have
bootdev=dk12, not bootdev=dk1 in boot.cfg:

[   5.1699079] boot device: dk1

[   5.1699079] root on dk1 dumps on dk11

[   5.1799090] Your machine does not initialize mem_clusters; sparse_dumps disabled

[   5.1799090] root file system type: ext2fs

[   5.1799090] kern.module.path=/stand/amd64/10.1/modules

[   5.1826204] exec /sbin/init: error 8

[   5.1826204] init: trying /sbin/oinit

[   5.1826204] exec /sbin/oinit: error 2

[   5.1826204] init: trying /sbin/init.bak

[   5.1826204] exec /sbin/init.bak: error 2

[   5.1826204] init: trying /rescue/init

[   5.1826204] exec /rescue/init: error 2

[   5.1826204] init path (default /sbin/init):

Here is where it stops to ask me for init, but it never gave me
the chance to enter the correct root device of dk12. Of course it
won't boot a Linux /sbin/init which in this case is a systemd
init image from a Linux distro.

> 
> What devices does it suggest if you type a "?" at the "root device"
> prompt?

[   5.1706012] root device (default wd1a): ?

[  44.8006002] use one of: dk0 dk1 dk2 dk3 dk4 dk5 dk6 dk7 dk8 dk9 dk10 dk11 dk12 rge0 athn0 wd0[a-p] wd1[a-p] ld0[a-p] cd0[a-p] wedge:<UUID or Label> ... ddb halt reboot

I redacted the output of the UUIDs and Labels. I did not try the syntax of wedge:<UUID or Label>
in boot.cfg, since I did not see this syntax documented anywhere. The documentation on
the NetBSD guide and in the Wikis, etc., always uses the NAME=UUID syntax. I tried both
NAME=UUID and NAME="UUID" in boot.cfg for the bootdev setting but neither form worked.

Since it expects wd1a as root when I set bootdev=wd1 in boot.cfg, I think
it would detect root correctly if I had a NetBSD disklabel on wd1 with wd1a
as the ffs root partition and it would detect the dump device correctly if
I had wd1b as the dump device. But without a disklabel wd1a is just the whole wd1
disk, not the ffs partition with the NetBSD root filesystem, and wd1b is not
configured at all without an actual disklabel setup on wd1.

Actually, the bootloader does not always give me the chance to enter the
root device in every boot.cfg configuration. For example, I get the
"root device" prompt when I pass bootdev as wd1 in boot.cfg, so I can tell
it to use dk12 as root and dk11 as dump and it boots... but I don't get
the "root device" prompt when I pass bootdev as the correct root device,
dk12, in boot.cfg and it incorrectly detects dk1 as the root device and
correctly detects dk11 as the dump device, as shown in the above messages.

I agree that this behavior is very strange.

> 
> It seems like NetBSD/Xen almost never gets the "boot device" correct,
> but I've never seen the kernel reject/ignore what seems to be a correct
> "bootdev=" option, and then accept the very same name at the prompt.

Yes, that is what is happening. However bootdev is not the same as rootdev.
In boot.cfg, I see that, according to the Xen Howto wiki, the bootdev option
used to be a root option but that has been changed now.

Quoting from the Xen Howto Wiki:

"bootdev" (or the earlier form "root") is also in general required, because
the boot device from /boot is not passed via Xen to the dom0 kernel.

End of Quote from Wiki.

Actually the confusion may be arising because in fact, the bootdev of the
system is where the EFI partition is. On this box, the EFI system partition is
on an nvme disk (ld0 with EFI system partition on dk0) but the NetBSD root
partition is on a SATA SSD disk (wd1). So the NetBSD root filesystem is not
on the same device that can truly be called the boot device. The bootloader
does not handle this situation very well when trying to boot NetBSD/xen dom0.

On this box I see the bootloader does expect root to be at wd1a and dump to be
at wd1b when I set bootdev=wd1 in boot.cfg, and this would be correct if wd1 had
a traditional NetBSD disklabel. So another way to describe the confusion is that
I have to set wd1 as the bootdev when in fact the bootdev is an nvme device
(ld0 as bootdev and dk0 as boot partition in this case), not wd1. BTW, I tried
bootdev=dk0 in boot.cfg but in that case the bootloader also fails to find the
correct root device and, just as with bootdev=dk12, the bootloader never gives
me the "root device" prompt.

So it appears the bootloader is very limited in the case of Xen and probably
only works well if the device that has NetBSD on it has a traditional BSD
disklabel that the bootloader understands or the partition scheme is simple
enough, such as when the boot device and the device that has the NetBSD root
partition on it are the same. This is not likely on a modern EFI system that
is setup to boot multiple operating systems.

Actually, I think the bootloader should have a rootdev option instead of or in
addition to a bootdev option, because the bootdev is not always the same as the
device where the NetBSD root partition is located, especially on EFI systems
where there is only one boot device but the NetBSD root partition should be
able to be found on any other device in the system.

> 
> --
> 					Greg A. Woods <gwoods%acm.org@localhost>
> 
> Kelowna, BC     +1 250 762-7675           RoboHack <woods%robohack.ca@localhost>
> Planix, Inc. <woods%planix.com@localhost>     Avoncote Farms <woods%avoncote.ca@localhost>

Follow-Ups:
- Re: Xen boot strangeness (Was: Re: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen))
  - From: Greg A. Woods
- Re: Xen boot strangeness (Was: Re: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen))
  - From: Manuel Bouyer

References:
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Chuck Zmudzinski
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Manuel Bouyer
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Chuck Zmudzinski
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Manuel Bouyer
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Chuck Zmudzinski
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Chuck Zmudzinski
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Manuel Bouyer
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Chuck Zmudzinski
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Manuel Bouyer
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Chuck Zmudzinski
- Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Manuel Bouyer
- [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Chuck Zmudzinski
- Re: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
  - From: Greg A. Woods

Prev by Date: Re: Tweeking X11
Next by Date: Re: Xen boot strangeness (Was: Re: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen))
Previous by Thread: Re: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen)
Next by Thread: Re: Xen boot strangeness (Was: Re: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen))
Indexes:

Home | Main Index | Thread Index | Old Index