QEMU运行GPU passthrough的虚拟机安装NVIDIA CUDA
参考 在OVMF虚拟机中安装NVIDIA Linux驱动 积累的经验,我在 BLFS QEMU 环境中继续为 在QEMU中运行GPU passthrough的Debian 安装NVIDIA Linux驱动,这样构建成一个能够使用GPU加速的机器学习环境,为后续 深度学习 做准备。
在QEMU中运行GPU passthrough的Debian 启动 vfio-pci
配置的虚拟机:
name=d2l
qemu-system-x86_64 \
-nodefaults \
-enable-kvm \
-cpu host,kvm=off \
-bios /usr/share/OVMF/OVMF_CODE.fd \
-m 32G \
-smp cores=4 \
-device vfio-pci,host=82:00.0 \
-drive file=/sources/images/${name}.qcow,if=virtio \
-net nic,model=virtio,macaddr=52:54:00:00:00:01 -net bridge,br=br0 \
-vga std \
-vnc :0 \
-serial mon:stdio \
-name "${name}"
# 默认终端提示
# VNC server running on 127.0.0.1:5900
# 如果需要VNC监听所有网络接口,则添加参数 -vnc :0 ,此时终端就看不到提示,但是使用VNC客户端可以连接
# lspci -nnk -d 10de:1e37
# 输出显示设备:
82:00.0 3D controller [0302]: NVIDIA Corporation TU102GL [Tesla T10 16GB / GRID RTX T10-2/T10-4/T10-8] [10de:1e37] (rev a1)
Subsystem: NVIDIA Corporation Tesla T10 16GB [10de:1370]
Kernel driver in use: vfio-pci
# 则在 qemu 运行参数中添加设备使用VFIO group id "82:00.0"
# 即 -device vfio-pci,host=82:00.0 \
备注
本文实践是在 在QEMU中运行GPU passthrough的Debian 直接运行CUDA
如果是 基于QEMU+Docker使用Tesla T10 ,则虚拟机中只执行 在OVMF虚拟机中安装NVIDIA Linux驱动
准备工作
在没有安装NVIDIA Linux驱动之前,检查系统日志可以看到操作系统默认加载了开源的
nouveau
,但是加载nvidia/tu102
firmware失败:
# dmesg -T | grep -i nvidia
[Fri Feb 7 14:41:31 2025] nouveau 0000:00:04.0: NVIDIA TU102 (162000a1)
[Fri Feb 7 14:41:31 2025] audit: type=1400 audit(1738910491.813:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=382 comm="apparmor_parser"
[Fri Feb 7 14:41:31 2025] audit: type=1400 audit(1738910491.813:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=382 comm="apparmor_parser"
[Fri Feb 7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/nvdec/scrubber.bin (-2)
[Fri Feb 7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/nvdec/scrubber.bin (-2)
[Fri Feb 7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb 7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb 7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb 7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb 7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/unload_bl.bin (-2)
[Fri Feb 7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/unload_bl.bin (-2)
安装
根据不同发行版在 NVIDIA CUDA Toolkit repo 下载 选择对应的 CUDA软件仓库 ,这里针对 Debian
12
安装仓库配置如下
排查
只安装 cuda-toolkit
没有安装 cuda-drivers
我最初只安装了 cuda-toolkit
,完成后重启系统,检查 dmesg -T
输出:
nouveau
并且提示firmware加载失败,电源没有连接好?[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: bios: version 90.02.41.00.01
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/nvdec/scrubber.bin (-2)
[Fri Feb 7 20:12:31 2025] firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/nvdec/scrubber.bin (-2)
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/unload_bl.bin (-2)
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/unload_bl.bin (-2)
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: pmu: firmware unavailable
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: gr: firmware unavailable
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: sec2: firmware unavailable
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: gpio: GPU is missing power, check its power cables. Boot with nouveau.config=NvPowerChecks=0 to disable.
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: gpio: init failed, -22
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: init failed with -22
[Fri Feb 7 20:12:31 2025] nouveau: DRM-master:00000000:00000080: init failed with -22
[Fri Feb 7 20:12:31 2025] nouveau 0000:00:04.0: DRM-master: Device allocation failed: -22
[Fri Feb 7 20:12:31 2025] nouveau: probe of 0000:00:04.0 failed with error -22
[Fri Feb 7 20:12:31 2025] systemd-journald[232]: Time jumped backwards, rotating.
我发现我搞错了,原来需要先安装 cuda-drivers
再安装 cuda-toolkit
(或者两个一起安装?),两者并没有包含关系。不安装 cuda-drivers
会导致主机只使用了开源驱动 nouveau
,无法正确使用CUDA。
所以补充安装 cuda-drivers
然后再次重启(前文已经修订正确): 安装 cuda-drivers
之后,在终端控制台会看到提示驱动加载版本信息:
cuda-drivers
之后,控制台提示版本信息NVRM: loading NVIDIA UNIX x86_64 Kernel Module 570.86.15
Failed to allocate NvKmsKapiDevice
报错
安装 cuda-drivers
驱动之后,重启系统,在终端看到报错:
nv_drm_load
加载失败[ 7.613002] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to allocate NvKmsKapiDevice
[ 7.617015] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to register device
由于之前在 在OVMF虚拟机中安装NVIDIA Linux驱动 经验: 需要调整虚拟机内核参数 pci=realloc
,所以尝试修订:
/etc/default/grub
添加 pci=realloc
参数# GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX_DEFAULT="pci=realloc quiet splash"
好像不是这个原因
检查 dmesg -T
显示有一个 uncorrectable ECC error detected
:
uncorrectable ECC error
[Fri Feb 7 20:56:16 2025] nvidia: loading out-of-tree module taints kernel.
[Fri Feb 7 20:56:16 2025] nvidia: module license 'NVIDIA' taints kernel.
[Fri Feb 7 20:56:16 2025] Disabling lock debugging due to kernel taint
[Fri Feb 7 20:56:16 2025] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[Fri Feb 7 20:56:16 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[Fri Feb 7 20:56:16 2025] ACPI: \_SB_.LNKD: Enabled at IRQ 11
[Fri Feb 7 20:56:16 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 570.86.15 Thu Jan 23 23:23:10 UTC 2025
[Fri Feb 7 20:56:16 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 570.86.15 Thu Jan 23 22:30:06 UTC 2025
[Fri Feb 7 20:56:16 2025] [drm] [nvidia-drm] [GPU ID 0x00000004] Loading driver
[Fri Feb 7 20:56:17 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Fri Feb 7 20:56:17 2025] ACPI Warning: \_SB.PCI0.S20._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20220331/nsarguments-61)
[Fri Feb 7 20:56:17 2025] NVRM: GPU at PCI:0000:00:04: GPU-6db5b7b7-e914-19cd-3bc0-017cf2996a65
[Fri Feb 7 20:56:17 2025] NVRM: Xid (PCI:0000:00:04): 140, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691974, LTC:0, MMU:0, PCIE:0
[Fri Feb 7 20:56:17 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Fri Feb 7 20:56:18 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
[Fri Feb 7 20:56:18 2025] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to allocate NvKmsKapiDevice
[Fri Feb 7 20:56:18 2025] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to register device
[Fri Feb 7 20:56:18 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Fri Feb 7 20:56:18 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Fri Feb 7 20:56:18 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
[Fri Feb 7 20:56:18 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Fri Feb 7 20:56:18 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Fri Feb 7 20:56:18 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
[Fri Feb 7 20:56:26 2025] systemd-journald[243]: Time jumped backwards, rotating.
这个 uncorrectable ECC error
看起来是 VRAM 存在ECC校验硬件错误了,情况和 NVIDIA 论坛 Problems with A100 and Ubuntu 22.04 相似,硬件异常。
还存在疑惑
我尝试将 Tesla T10 从 HPE ProLiant DL360 Gen9服务器 的 PCIe 3
插槽换到 PCIe 1
一点乌龙
这里有点乌龙,我忘记之前 PCIe bifurcation 将 PCIe 1
分为2个,结果发现 Tesla T10 在这种 PCIe bifurcation 通过 vfio-pci
passthrough到虚拟机内部,执行启动会出现如下报错:
qemu-system-x86_64: ../hw/pci/pci.c:1633: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
./run_d2l: line 16: 1143 Aborted qemu-system-x86_64 -nodefaults -enable-kvm -cpu host,kvm=off -bios /usr/share/OVMF/OVMF_CODE.fd -m 32G -smp cores=4 -device vfio-pci,host=0000:08:00.0 -drive file=/sources/images/${name}.qcow,if=virtio -net nic,model=virtio,macaddr=52:54:00:00:00:01 -net bridge,br=br0 -vga std -vnc :0 -serial mon:stdio -name "${name}"
Tesla T10 插槽换到 PCIe 1
将 HPE ProLiant DL360 Gen9服务器 的系统BIOS恢复默认重新设置后,关闭了 PCIe bifurcation ,现在 Tesla T10 插槽在 PCIe 1
,重新通过vfio-pci直接passthrough到虚拟机内部。这次VM启动后观察,发现同样报 uncorrectable ECC error
:
[Sat Feb 8 14:43:37 2025] nvidia: loading out-of-tree module taints kernel.
[Sat Feb 8 14:43:37 2025] nvidia: module license 'NVIDIA' taints kernel.
[Sat Feb 8 14:43:37 2025] Disabling lock debugging due to kernel taint
[Sat Feb 8 14:43:37 2025] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[Sat Feb 8 14:43:38 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[Sat Feb 8 14:43:38 2025] systemd-journald[227]: Time jumped backwards, rotating.
[Sat Feb 8 14:43:38 2025] ACPI: \_SB_.LNKD: Enabled at IRQ 11
[Sat Feb 8 14:43:38 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 570.86.15 Thu Jan 23 23:23:10 UTC 2025
[Sat Feb 8 14:43:38 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 570.86.15 Thu Jan 23 22:30:06 UTC 2025
[Sat Feb 8 14:43:38 2025] [drm] [nvidia-drm] [GPU ID 0x00000004] Loading driver
[Sat Feb 8 14:43:38 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb 8 14:43:38 2025] ACPI Warning: \_SB.PCI0.S20._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20220331/nsarguments-61)
[Sat Feb 8 14:43:39 2025] NVRM: GPU at PCI:0000:00:04: GPU-6db5b7b7-e914-19cd-3bc0-017cf2996a65
[Sat Feb 8 14:43:39 2025] NVRM: Xid (PCI:0000:00:04): 140, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691974, LTC:0, MMU:0, PCIE:0
[Sat Feb 8 14:43:39 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb 8 14:43:39 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
[Sat Feb 8 14:43:39 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb 8 14:43:39 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb 8 14:43:39 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
[Sat Feb 8 14:43:39 2025] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to allocate NvKmsKapiDevice
[Sat Feb 8 14:43:39 2025] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to register device
[Sat Feb 8 14:43:39 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb 8 14:43:39 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb 8 14:43:39 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
另外,观察到物理主机的控制台上显示报错:
NMI: IOCK error (debug interrupt?) for reason 65 on CPU 0.
NMI: IOCK error (debug interrupt?) for reason 65 on CPU 0.
NMI: IOCK error (debug interrupt?) for reason 65 on CPU 0.
...
改为物理主机使用 Tesla T10 对比
由于我是在淘宝上购买的二手 Tesla T10 ,所以硬件质量不能保证。但是我也不能确定是不是我的使用虚拟化运行问题,所以改为直接使用 Debian 物理主机来使用这块 Tesla T10 。我甚至还重装了一遍 Debian
重启系统后,使用 lspci -vvv
可以看到这块 Tesla T10 使用了对应的nvidia驱动
但是,系统 dmesg
日志还是显示 uncorrectable ECC error detected
:
[Sat Feb 8 21:20:38 2025] nvidia: loading out-of-tree module taints kernel.
[Sat Feb 8 21:20:38 2025] nvidia: module license 'NVIDIA' taints kernel.
[Sat Feb 8 21:20:38 2025] Disabling lock debugging due to kernel taint
[Sat Feb 8 21:20:38 2025] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[Sat Feb 8 21:20:38 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[Sat Feb 8 21:20:38 2025] nvidia 0000:08:00.0: enabling device (0140 -> 0142)
[Sat Feb 8 21:20:38 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 570.86.15 Thu Jan 23 23:23:10 UTC 2025
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6c
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6d
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[Sat Feb 8 21:20:38 2025] EDAC MC0: Giving out device to module sb_edac controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[Sat Feb 8 21:20:38 2025] EDAC MC1: Giving out device to module sb_edac controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[Sat Feb 8 21:20:38 2025] EDAC MC2: Giving out device to module sb_edac controller Haswell SrcID#0_Ha#1: DEV 0000:7f:12.4 (INTERRUPT)
[Sat Feb 8 21:20:38 2025] EDAC MC3: Giving out device to module sb_edac controller Haswell SrcID#1_Ha#1: DEV 0000:ff:12.4 (INTERRUPT)
[Sat Feb 8 21:20:38 2025] EDAC sbridge: Ver: 1.1.2
[Sat Feb 8 21:20:38 2025] intel_rapl_common: Found RAPL domain package
[Sat Feb 8 21:20:38 2025] intel_rapl_common: Found RAPL domain dram
[Sat Feb 8 21:20:38 2025] intel_rapl_common: DRAM domain energy unit 15300pj
[Sat Feb 8 21:20:38 2025] intel_rapl_common: Found RAPL domain package
[Sat Feb 8 21:20:38 2025] intel_rapl_common: Found RAPL domain dram
[Sat Feb 8 21:20:38 2025] intel_rapl_common: DRAM domain energy unit 15300pj
[Sat Feb 8 21:20:38 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 570.86.15 Thu Jan 23 22:30:06 UTC 2025
[Sat Feb 8 21:20:38 2025] [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
[Sat Feb 8 21:20:38 2025] nvidia 0000:08:00.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb 8 21:20:39 2025] NVRM: GPU at PCI:0000:08:00: GPU-6db5b7b7-e914-19cd-3bc0-017cf2996a65
[Sat Feb 8 21:20:39 2025] NVRM: Xid (PCI:0000:08:00): 140, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691974, LTC:0, MMU:0, PCIE:0
[Sat Feb 8 21:20:39 2025] NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb 8 21:20:39 2025] NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
[Sat Feb 8 21:20:39 2025] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to allocate NvKmsKapiDevice
[Sat Feb 8 21:20:39 2025] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to register device
[Sat Feb 8 21:20:39 2025] DMAR: DRHD: handling fault status reg 2
[Sat Feb 8 21:20:39 2025] DMAR: [DMA Write NO_PASID] Request device [08:00.0] fault addr 0xfbfd0000 [fault reason 0x05] PTE Write access is not set
[Sat Feb 8 21:20:39 2025] DMAR: [DMA Write NO_PASID] Request device [08:00.0] fault addr 0xfbfd0000 [fault reason 0x05] PTE Write access is not set
[Sat Feb 8 21:20:39 2025] DMAR: [DMA Write NO_PASID] Request device [08:00.0] fault addr 0xfbfd0000 [fault reason 0x05] PTE Write access is not set
[Sat Feb 8 21:20:39 2025] DMAR: [DMA Write NO_PASID] Request device [08:00.0] fault addr 0xfff47000 [fault reason 0x05] PTE Write access is not set
[Sat Feb 8 21:20:39 2025] DMAR: DRHD: handling fault status reg 400
[Sat Feb 8 21:20:39 2025] DMAR: DRHD: handling fault status reg 402
[Sat Feb 8 21:20:39 2025] DMAR: [DMA Write NO_PASID] Request device [08:00.0] fault addr 0xfbff0000 [fault reason 0x05] PTE Write access is not set
[Sat Feb 8 21:20:39 2025] nvidia 0000:08:00.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb 8 21:20:39 2025] NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb 8 21:20:39 2025] NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
[Sat Feb 8 21:20:39 2025] nvidia 0000:08:00.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb 8 21:20:39 2025] NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb 8 21:20:39 2025] NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
参考NVIDIA官方文档 Xid Errors 其中 Xid Errors => 140 Unrecovered ECC Error
表示 GPU driver has observed uncorrectable errors in GPU memory, in such a way as to interrupt the GPU driver’s ability to mark the pages for dynamic page offlining or row remapping
很不幸,这次实践最后没有完成 Tesla T10 硬件异常,最后退还给淘宝卖家了。等以后有机会再做探索...