HPE DL360 Gen9服务器PCI Bus Error
警告
服务器适合长时间加电运行,不适合反复开关:
我感觉二手服务器尤其脆弱,不适合长时间关机。
我的 HPE DL360 Gen9服务器 是2021年9月购买,算起来持续使用了2年半。不过,最近半年因为失业( 凡是过往 皆为序章 )外出旅行,所以关机了半年。这应该是这台受到伤害的最大原因,受到上海潮湿闷热天气的折磨之后,终于在今天开机出现了严重的错误告警:
从 Integrated Management Log (CSV格式)可以看到:
| ID | Severity | Class | Last Update | Initial Update | Count | Description | 
|---|---|---|---|---|---|---|
| 246 | Critical | System Error | 08/08/2024 01:31 | [NOT SET] | 8 | An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000) | 
| 245 | Critical | System Error | 08/08/2024 01:29 | 08/08/2024 01:29 | 1 | An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000) | 
| 244 | Critical | PCI Bus | 08/08/2024 01:31 | [NOT SET] | 40 | PCI Bus Error (Slot 1, Bus 0, Device 3, Function 2) | 
| 243 | Critical | PCI Bus | 08/08/2024 01:31 | [NOT SET] | 17 | PCI Bus Error (Slot 1, Bus 0, Device 3, Function 0) | 
| 242 | Critical | OS | 08/08/2024 01:28 | [NOT SET] | 1 | User Remotely Initiated NMI Switch | 
| 241 | Critical | OS | 08/08/2024 01:28 | 08/08/2024 01:28 | 1 | User Remotely Initiated NMI Switch | 
| 240 | Critical | PCI Bus | 08/08/2024 01:31 | [NOT SET] | 118 | Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 2, Error status 0x00000020) | 
| 239 | Critical | System Error | 08/08/2024 01:31 | [NOT SET] | 83 | Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible | 
| 238 | Critical | PCI Bus | 08/08/2024 01:31 | [NOT SET] | 49 | Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 0, Error status 0x00000020) | 
| 237 | Critical | PCI Bus | 08/08/2024 01:28 | 08/08/2024 01:28 | 1 | Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 0, Error status 0x00000020) | 
| 236 | Critical | System Error | 08/08/2024 01:28 | 08/08/2024 01:28 | 1 | Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible | 
| 235 | Critical | PCI Bus | 08/08/2024 01:28 | 08/08/2024 01:28 | 1 | PCI Bus Error (Slot 1, Bus 0, Device 3, Function 0) | 
| 234 | Caution | POST Message | 08/08/2024 01:25 | [NOT SET] | 1 | POST Error: 295-DIMM Failure - Uncorrectable Memory Error - Processor 1, DIMM 9. This memory will not be available to the operating system. ACTION: Replace the failed DIMM to restore the full amount of memory. | 
| 233 | Caution | POST Message | 08/08/2024 01:25 | [NOT SET] | 1 | POST Error: 207-Memory initialization error on Processor 1, DIMM 8. The operating system may not have access to all of the memory installed in the system. | 
| 232 | Caution | POST Message | 08/08/2024 01:25 | 08/08/2024 01:25 | 1 | POST Error: 207-Memory initialization error on Processor 1, DIMM 9. The operating system may not have access to all of the memory installed in the system. | 
太不幸了,两个月前还正常启动的服务器罢工了...
排查
- HP服务器的 HP服务器iLO技术 提供了非常方便的图形管理,通过检查管理日志可以看到错误集中在 - Processor 1的两个内存插槽上- DIMM 8和- DIMM 9,这表明要么是内存条故障了,要么是内存接触不良:- 考虑到同时出现两根内存条硬件故障可能性较低,所以我倾向于是内存条插入连接不良,也就是通过内存条重新插拔可能能够解决这个问题 
 
 
 
- 需要注意 HP DL360 Gen9 内存安装 顺序: 
 
HPE DL360 Gen9 内存插槽顺序
- 重新多次插拔出现报错的DIMM 内存,然后重新开机。果然,系统内存检测就完全正常,从 HP服务器iLO技术 的 - System Information >> Memory Information查看,可以看到刚才报错的DIMM内存条已经正常工作(状态- Good, In Use:
 
再次发作
在使用了一段时间之后,2025年2月,这台老掉牙的 HPE ProLiant DL360 Gen9服务器 终于拒绝启动。我马上买了一台 HPE ProLiant DL380 Gen9服务器 准系统来替换,当时推测是主板问题。而且确实CPU换到新的HP DL380 gen9能够正常启动。
不过很不幸, 经过几天开关机,服务器再次拒绝启动,不过 iLO 管理界面可以登陆,检查 Integrated Management Log 显示报错
| ID | Severity | Class | Last Update | Initial Update | Count | Description | 
|---|---|---|---|---|---|---|
| 153 | Critical | System Error | 02/25/2025 13:33 | [NOT SET] | 1 | An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000) | 
| 152 | Critical | PCI Bus | 02/25/2025 13:33 | [NOT SET] | 1 | Uncorrectable PCI Express Error (Slot 2, Bus 0, Device 3, Function 2, Error status 0x00004020) | 
| 151 | Critical | System Error | 02/25/2025 13:33 | 02/25/2025 13:33 | 1 | An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000) | 
| 150 | Critical | PCI Bus | 02/25/2025 13:33 | 02/25/2025 13:33 | 1 | Uncorrectable PCI Express Error (Slot 2, Bus 0, Device 3, Function 2, Error status 0x00000020) | 
| 149 | Critical | System Error | 02/25/2025 13:33 | 02/25/2025 13:33 | 1 | Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible | 
| 148 | Critical | System Error | 02/25/2025 13:31 | 02/25/2025 13:31 | 1 | Server Critical Fault (Service Information: Input Power Loss, Power Supply, Power Supply 1 (03h) Power Supply 2 (03h)) | 
| 147 | Critical | PCI Bus | 02/25/2025 13:33 | 02/25/2025 13:30 | 2 | PCI Bus Error (Slot 2, Bus 0, Device 3, Function 2) | 
由于我已经更换了服务器主机(主板),所以我怀疑不是主板故障,而是我的旧服务器CPU出现了异常。不过,我尝试将 HPE ProLiant DL380 Gen9服务器 的PCIe扩展板上第2个PCIe存储卡拿掉以后,暂时恢复了正常启动。但我怀疑还会出现故障。