tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: cpu1: failed to start (amd64)



It feels to me like you might be having two problems: SMP/cpu and USB.
I do not understand if they are related or not.

Assuming you have a working netbsd-7 machine (i386 is fine), it might be
best to build a netbsd-8 kernel and debug there, since 8 has many fixes
since 7.  You have built a kernel, but you may find BUILD-NetBSD from
pkgsrc/sysutils/etcmanage useful; that's my heavily annotated invocation
of build.sh.  Your choice of debug options sounds good as a first step.

First, I suspect that "no ACPI" on a modern machine is just not going to
work, as that's how manythings are configured.  My impression is that
disabling ACPI is appropriate on hardware that just barely has ACPI
support, and that support is buggy.   So I'm going to not address the
"no ACPI" case.

This is strange, because while I'm not familiar with that mobo model, it
and the CPU sound very normal.

I wonder if your motherboard's BIOS is up to date.  It might be that the
kernel is getting bad ACPI info.

With 4 cpus, there is something going wrong, and I haven't seen this
before.  You could look in the kernel sources for "failed to start" and
see if you can understand the code.  It may help to print out whatever
information is being used to try to start the other cpus, but I have no
idea what that is.

  db{0}> bt
  vmem_alloc() at netbsd:vmem_alloc+0x3f
  uvm_km_kmem_alloc() at netbsd:uvm_km_kmem_alloc+0x46
  kmem_intr_alloc() at netbsd:kmem_intr_alloc+0x6d
  kmem_intr_zalloc() at netbsd:kmem_intr_zalloc+0xf
  mpbios_scan() at netbsd:mpbios_scan+0x4cd
  mainbus_attach() at netbsd:mainbus_attach+0x2d0
  config_attach_loc() at netbsd:config_attach_loc+0x16e
  cpu_configure() at netbsd:cpu_configure+0x26
  main() at netbsd:main+0x2a3

This looks like mpbios_scan has asked to allocate memory in some
unreasonable or crazy amount.  Really that should not fault/panic, but
if you are able to read mpbios_scan (maybe even disassemble to find the
C line for 0x4cd) and add sanity checking before alloc, that might lead
to figuring it out.

Interestingly the product id is different for all (so am guessing is all
on one chipset).

  The kernel boots if I remove
    uhci* at pci? dev ? function ?
  but then the USB drive is not detected and the boot device is not found.

as expected not to be found, but good that everything else is ok.
Presumably there is no 30s delay?

  The system has five uchi entries across two dev numbers. Enabling

Yes, I see

  uhci0 at pci0 dev 26 function 0: vendor 0x8086 product 0x2834 (rev. 0x02)
  uhci1 at pci0 dev 26 function 1: vendor 0x8086 product 0x2835 (rev. 0x02)
  ehci0 at pci0 dev 26 function 7: vendor 0x8086 product 0x283a (rev. 0x02)

  uhci2 at pci0 dev 29 function 0: vendor 0x8086 product 0x2830 (rev. 0x02)
  uhci3 at pci0 dev 29 function 1: vendor 0x8086 product 0x2831 (rev. 0x02)
  uhci4 at pci0 dev 29 function 2: vendor 0x8086 product 0x2832 (rev. 0x02)
  ehci1 at pci0 dev 29 function 7: vendor 0x8086 product 0x2836 (rev. 0x02)

Note that your system has ehci controllers, which are about USB3.  Your
flashdrive is on usb6 which is ehci1.  The way USB3 works is that the
ehci controllers have the ports and USB1/2 devices are handed off to the
uhci (or ohci non-Intel) controllers.  So I wonder if you disabled
those, or if it's only the uhci one you disabled.  But it looks like sd0
attaches in your posted dmesg.

  specific
  devices such as:
    uhci4 at pci0 dev 29 function 2
  allows the kernel to boot, but I have not yet got it to detect the USB
  drive in any combinations I have tried so far.

But your posted dmesg attaches?

  If I enable all five devices specifying dev and function numbers then it
  boots but pauses for a very long time (maybe 30+ seconds). I wondered if
  there USB retries/errors not being displayed so I turned on USBVERBOSE
  but saw no additional output.

So what you posted is with all 5 uhci lines, no uchi wildcard, and you
didn't change the ehci wildcard?


So it seems that something is matching the uhci driver, but when the
attach runs it is crashing, perhaps on some device which is somewhere
else.  You can use "pcictl pci0 list" and look up the ids for anything
odd, and then for the other buses.

And yes, if you can set up a serial console, at least to be used when
booting (even if the bios doesn't really cope, if you do boot with
consdev set), and capture, then you can add debugging and maybe figure
out more what's wrong.  I suspect that if you know exactly what's wrong,
this is not too hard to fix, adding some sort of quirk to not believe
something from ACPI, or substitute something sane, exclude some device
id, etc.

With serial you can also setup kgdb, but I'm not sure how soon in boot
that is set up relative to the crash.  This lets you run gdb on another
machine and debug the kernel remotely, with full source listings.  But
ddb is quite useful.

The 30s delay could be a third thing wrong.

Attachment: signature.asc
Description: PGP signature



Home | Main Index | Thread Index | Old Index