NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/58775 (apei(4) spamming console)



> Date: Sun, 27 Oct 2024 00:13:15 +0200
> From: Hauke Fath <hf%spg.tu-darmstadt.de@localhost>
> 
> I guess I'll hook up the machine's ipmi console on Monday, and see what 
> that has to say.

I would be curious to see any details you can find there!

> > Can you revert the previous patch and try the attached patch instead,
> > which applies a rate limit to the console output?
> 
> Done, resulted in a much more reasonable message rate. Thanks!

Great, can you share the new dmesg output?

> In the general case, how would I map the "error source" on hardware?

Not sure there's a good general way to do this -- these correspond to
SourceId numbers in acpidump.out, and you can follow to the Related
SourceId numbers, but I'm not sure you get much out of that.  E.g.,
hardware source 514 is a generic hardware error source which maps to
the related source:

	Type={PCI Express Endpoint AER}
	SourceId=257
	Flags={FIRMWARE_FIRST,GLOBAL}
	Enabled={ YES (ignored) }
	Number of Record to pre-allocate=1
	Max. Sections per Record=16
	Device Control=0x7
	Uncorrectable Error Mask Register=0x100000
	Uncorrectable Error Severity Register=0x7ef6030
	Correctable Error Mask Register=0x0
	Advanced Capabilities Register=0x0

Which doesn't really tell us much.

However, the log messages should show the PCI device identified in the
error record.  Something like this, in the new patch (now that I've
fixed the buffer sizing):

PCI 0000:81:00.000: hardware corrected error: 0x1<RECEIVER_ERROR> (mask=0x0)

This means segment 0, bus 0x81=129, device 0x00, and function 0, which
you can look up in dmesg:

[   1.0650718] pci8 at ppb5 bus 129
[   1.0650718] nvme0 at pci8 dev 0 function 0: Samsung Electronics (3rd vendor ID) PM9A1 M.2 NVMe SSD (rev. 0x00)

or with pcictl(8):

pcictl pci0 dump -b 0x81 -d 0 -f 0

That's how I identified it as your Samsung NVMe card -- specifically,
the first one, nvme0.  (That said, I don't know how to map that to
your physical motherboard layout.)

It is also shown in the DeviceID={...} lines, in somewhat obscure hex
(https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#pci-express-error-section),
which is how I decoded it in spite of the broken format string in the
first draft of the patch.


Home | Main Index | Thread Index | Old Index