NMI: IOCK error

二手 HPE ProLiant DL360 Gen9服务器 故障无法启动后,我重新购买了 HPE ProLiant DL380 Gen9服务器 准系统。因为猜测可能是原先 DL360 gen9 主板故障,所以我只购买了 DL380 gen9准系统,将原来服务器的CPU和内存全部搬迁到新 DL380 gen9准系统 使用。果然能够开机使用了。

但是,好景不长,使用没有几天,一次重启后发现服务器没有启动起来,终端控制台不断输出如下报错:

控制台输出 NMI: IOCK error 报错
[ 4824.338877] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.
[ 4826.357482] NMI: IOCK error (debug interrupt?) for reason 65 on CPU 0.
[ 4828.376653] NMI: IOCK error (debug interrupt?) for reason 65 on CPU 0.
[ 4830.393236] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.
[ 4832.410729] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.

检查

  • dmesg -T 检查启动日志有一些疑问点:

dmesg 的一些报错信息
...
[Sat Feb 22 19:48:45 2025] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (family: 0x6, model: 0x3f, stepping: 0x2)
[Sat Feb 22 19:48:45 2025] Performance Events: PEBS fmt2+, Haswell events, 16-deep LBR, full-width counters, Broken BIOS detected, complain to your hardware vendor.
[Sat Feb 22 19:48:45 2025] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
...
[Sat Feb 22 19:48:45 2025] PCI host bridge to bus 0000:7f
[Sat Feb 22 19:48:45 2025] pci_bus 0000:7f: Unknown NUMA node; performance will be reduced
[Sat Feb 22 19:48:45 2025] pci_bus 0000:7f: root bus resource [bus 7f]
[Sat Feb 22 19:48:45 2025] pci 0000:7f:08.0: [8086:2f80] type 00 class 0x088000 conventional PCI endpoint
[Sat Feb 22 19:48:45 2025] pci 0000:7f:08.3: [8086:2f83] type 00 class 0x088000 conventional PCI endpoint
...
[Sat Feb 22 19:48:48 2025] i8042: Can't read CTR while initializing i8042
[Sat Feb 22 19:48:48 2025] i8042 i8042: probe with driver i8042 failed with error -5
[Sat Feb 22 19:48:48 2025] tsc: Refined TSC clocksource calibration: 2297.339 MHz
[Sat Feb 22 19:48:48 2025] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x211d634274b, max_idle_ns: 440795203348 ns
[Sat Feb 22 19:48:48 2025] clocksource: Switched to clocksource tsc
...
[Sat Feb 22 19:48:48 2025] fail to initialize ptp_kvm
...
[Sat Feb 22 19:48:48 2025] i2c i2c-0: More than 8 memory slots on a single bus, contact i801 maintainer to add missing mux config
...

目前重启系统正常运行,但系统日志中有一些不太正常的内容有待排查

  • Red Hat 建议设置 kernel.panic_on_io_nmi = 1 sysctl:

设置 kernel.panic_on_io_nmi = 1 sysctl
# cat /etc/sysctl.conf
kernel.panic_on_io_nmi = 1

然后执行 sysctl -p 刷新设置

完成后检查 nmi 相关内核设置:

sysctl -A 检查
# sysctl -A | grep kernel | grep nmi
kernel.panic_on_io_nmi = 1
kernel.panic_on_unrecovered_nmi = 0
kernel.unknown_nmi_panic = 0

NMI: IOCK error 可能是硬件IO相关错误,后续待观察

再次发作

经过几天开关机,服务器再次拒绝启动,不过 iLO 管理界面可以登陆,检查 Integrated Management Log 显示报错:

参考