HPE DL360 Gen9服务器PCI Bus Error

警告

服务器适合长时间加电运行,不适合反复开关:

我感觉二手服务器尤其脆弱,不适合长时间关机。

我的 HPE DL360 Gen9服务器 是2021年9月购买,算起来持续使用了2年半。不过,最近半年因为失业( 凡是过往 皆为序章 )外出旅行,所以关机了半年。这应该是这台受到伤害的最大原因,受到上海潮湿闷热天气的折磨之后,终于在今天开机出现了严重的错误告警:

Integrated Management Log (CSV格式)可以看到:

HPE DL360 gen9服务器PCI总线错误日志

ID

Severity

Class

Last Update

Initial Update

Count

Description

246

Critical

System Error

08/08/2024 01:31

[NOT SET]

8

An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000)

245

Critical

System Error

08/08/2024 01:29

08/08/2024 01:29

1

An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000)

244

Critical

PCI Bus

08/08/2024 01:31

[NOT SET]

40

PCI Bus Error (Slot 1, Bus 0, Device 3, Function 2)

243

Critical

PCI Bus

08/08/2024 01:31

[NOT SET]

17

PCI Bus Error (Slot 1, Bus 0, Device 3, Function 0)

242

Critical

OS

08/08/2024 01:28

[NOT SET]

1

User Remotely Initiated NMI Switch

241

Critical

OS

08/08/2024 01:28

08/08/2024 01:28

1

User Remotely Initiated NMI Switch

240

Critical

PCI Bus

08/08/2024 01:31

[NOT SET]

118

Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 2, Error status 0x00000020)

239

Critical

System Error

08/08/2024 01:31

[NOT SET]

83

Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible

238

Critical

PCI Bus

08/08/2024 01:31

[NOT SET]

49

Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 0, Error status 0x00000020)

237

Critical

PCI Bus

08/08/2024 01:28

08/08/2024 01:28

1

Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 0, Error status 0x00000020)

236

Critical

System Error

08/08/2024 01:28

08/08/2024 01:28

1

Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible

235

Critical

PCI Bus

08/08/2024 01:28

08/08/2024 01:28

1

PCI Bus Error (Slot 1, Bus 0, Device 3, Function 0)

234

Caution

POST Message

08/08/2024 01:25

[NOT SET]

1

POST Error: 295-DIMM Failure - Uncorrectable Memory Error - Processor 1, DIMM 9. This memory will not be available to the operating system. ACTION: Replace the failed DIMM to restore the full amount of memory.

233

Caution

POST Message

08/08/2024 01:25

[NOT SET]

1

POST Error: 207-Memory initialization error on Processor 1, DIMM 8. The operating system may not have access to all of the memory installed in the system.

232

Caution

POST Message

08/08/2024 01:25

08/08/2024 01:25

1

POST Error: 207-Memory initialization error on Processor 1, DIMM 9. The operating system may not have access to all of the memory installed in the system.

太不幸了,两个月前还正常启动的服务器罢工了...

排查

  • HP服务器的 HP服务器iLO技术 提供了非常方便的图形管理,通过检查管理日志可以看到错误集中在 Processor 1 的两个内存插槽上 DIMM 8DIMM 9 ,这表明要么是内存条故障了,要么是内存接触不良:

    • 考虑到同时出现两根内存条硬件故障可能性较低,所以我倾向于是内存条插入连接不良,也就是通过内存条重新插拔可能能够解决这个问题

../../../../_images/dl360_gen9_pci_bus_error-1.png
../../../../_images/dl360_gen9_pci_bus_error-2.png
../../../../_images/hpe_dl360_gen9_memory.webp

HPE DL360 Gen9 内存插槽顺序

  • 重新多次插拔出现报错的DIMM 内存,然后重新开机。果然,系统内存检测就完全正常,从 HP服务器iLO技术System Information >> Memory Information 查看,可以看到刚才报错的DIMM内存条已经正常工作(状态 Good, In Use :

../../../../_images/dl360_gen9_pci_bus_error-3.png

再次发作

在使用了一段时间之后,2025年2月,这台老掉牙的 HPE ProLiant DL360 Gen9服务器 终于拒绝启动。我马上买了一台 HPE ProLiant DL380 Gen9服务器 准系统来替换,当时推测是主板问题。而且确实CPU换到新的HP DL380 gen9能够正常启动。

不过很不幸, 经过几天开关机,服务器再次拒绝启动,不过 iLO 管理界面可以登陆,检查 Integrated Management Log 显示报错

HPE DL380 gen9服务器PCI总线错误日志

ID

Severity

Class

Last Update

Initial Update

Count

Description

153

Critical

System Error

02/25/2025 13:33

[NOT SET]

1

An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000)

152

Critical

PCI Bus

02/25/2025 13:33

[NOT SET]

1

Uncorrectable PCI Express Error (Slot 2, Bus 0, Device 3, Function 2, Error status 0x00004020)

151

Critical

System Error

02/25/2025 13:33

02/25/2025 13:33

1

An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000)

150

Critical

PCI Bus

02/25/2025 13:33

02/25/2025 13:33

1

Uncorrectable PCI Express Error (Slot 2, Bus 0, Device 3, Function 2, Error status 0x00000020)

149

Critical

System Error

02/25/2025 13:33

02/25/2025 13:33

1

Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible

148

Critical

System Error

02/25/2025 13:31

02/25/2025 13:31

1

Server Critical Fault (Service Information: Input Power Loss, Power Supply, Power Supply 1 (03h) Power Supply 2 (03h))

147

Critical

PCI Bus

02/25/2025 13:33

02/25/2025 13:30

2

PCI Bus Error (Slot 2, Bus 0, Device 3, Function 2)

由于我已经更换了服务器主机(主板),所以我怀疑不是主板故障,而是我的旧服务器CPU出现了异常。不过,我尝试将 HPE ProLiant DL380 Gen9服务器 的PCIe扩展板上第2个PCIe存储卡拿掉以后,暂时恢复了正常启动。但我怀疑还会出现故障。