QEMU运行GPU passthrough的虚拟机安装NVIDIA CUDA

参考 在OVMF虚拟机中安装NVIDIA Linux驱动 积累的经验,我在 BLFS QEMU 环境中继续为 在QEMU中运行GPU passthrough的Debian 安装NVIDIA Linux驱动,这样构建成一个能够使用GPU加速的机器学习环境,为后续 深度学习 做准备。

在QEMU中运行GPU passthrough的Debian 启动 vfio-pci 配置的虚拟机:

运行UEFI虚拟机(使用VNC)
name=d2l

qemu-system-x86_64 \
    -nodefaults \
    -enable-kvm \
    -cpu host,kvm=off \
    -bios /usr/share/OVMF/OVMF_CODE.fd \
    -m 32G \
    -smp cores=4 \
    -device vfio-pci,host=82:00.0 \
    -drive file=/sources/images/${name}.qcow,if=virtio \
    -net nic,model=virtio,macaddr=52:54:00:00:00:01 -net bridge,br=br0 \
    -vga std \
    -vnc :0 \
    -serial mon:stdio \
    -name "${name}"

# 默认终端提示
# VNC server running on 127.0.0.1:5900
# 如果需要VNC监听所有网络接口,则添加参数 -vnc :0 ,此时终端就看不到提示,但是使用VNC客户端可以连接

# lspci -nnk -d 10de:1e37
# 输出显示设备:
82:00.0 3D controller [0302]: NVIDIA Corporation TU102GL [Tesla T10 16GB / GRID RTX T10-2/T10-4/T10-8] [10de:1e37] (rev a1)
	Subsystem: NVIDIA Corporation Tesla T10 16GB [10de:1370]
	Kernel driver in use: vfio-pci
# 则在 qemu 运行参数中添加设备使用VFIO group id "82:00.0"
# 即 -device vfio-pci,host=82:00.0 \

备注

本文实践是在 在QEMU中运行GPU passthrough的Debian 直接运行CUDA

如果是 基于QEMU+Docker使用Tesla T10 ,则虚拟机中只执行 在OVMF虚拟机中安装NVIDIA Linux驱动

准备工作

  • 在没有安装NVIDIA Linux驱动之前,检查系统日志可以看到操作系统默认加载了开源的 nouveau ,但是加载 nvidia/tu102 firmware失败:

安装CUDA驱动之前系统日志
# dmesg -T | grep -i nvidia
[Fri Feb  7 14:41:31 2025] nouveau 0000:00:04.0: NVIDIA TU102 (162000a1)
[Fri Feb  7 14:41:31 2025] audit: type=1400 audit(1738910491.813:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=382 comm="apparmor_parser"
[Fri Feb  7 14:41:31 2025] audit: type=1400 audit(1738910491.813:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=382 comm="apparmor_parser"
[Fri Feb  7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/nvdec/scrubber.bin (-2)
[Fri Feb  7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/nvdec/scrubber.bin (-2)
[Fri Feb  7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb  7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb  7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb  7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb  7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/unload_bl.bin (-2)
[Fri Feb  7 14:41:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/unload_bl.bin (-2)

安装

Debian 12操作系统添加NVIDIA官方软件仓库配
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update

# cuda-drivers和cuda-toolkit需要分别安装
apt-get -y install cuda-drivers cuda-toolkit

排查

只安装 cuda-toolkit 没有安装 cuda-drivers

我最初只安装了 cuda-toolkit ,完成后重启系统,检查 dmesg -T 输出:

显示没有正确加载NVIDIA驱动,依然是 nouveau 并且提示firmware加载失败,电源没有连接好?
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: bios: version 90.02.41.00.01
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/nvdec/scrubber.bin (-2)
[Fri Feb  7 20:12:31 2025] firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/nvdec/scrubber.bin (-2)
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/bl.bin (-2)
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/unload_bl.bin (-2)
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: firmware: failed to load nvidia/tu102/acr/unload_bl.bin (-2)
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: pmu: firmware unavailable
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: gr: firmware unavailable
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: sec2: firmware unavailable
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: gpio: GPU is missing power, check its power cables.  Boot with nouveau.config=NvPowerChecks=0 to disable.
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: gpio: init failed, -22
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: init failed with -22
[Fri Feb  7 20:12:31 2025] nouveau: DRM-master:00000000:00000080: init failed with -22
[Fri Feb  7 20:12:31 2025] nouveau 0000:00:04.0: DRM-master: Device allocation failed: -22
[Fri Feb  7 20:12:31 2025] nouveau: probe of 0000:00:04.0 failed with error -22
[Fri Feb  7 20:12:31 2025] systemd-journald[232]: Time jumped backwards, rotating.

我发现我搞错了,原来需要先安装 cuda-drivers 再安装 cuda-toolkit (或者两个一起安装?),两者并没有包含关系。不安装 cuda-drivers 会导致主机只使用了开源驱动 nouveau ,无法正确使用CUDA。

所以补充安装 cuda-drivers 然后再次重启(前文已经修订正确): 安装 cuda-drivers 之后,在终端控制台会看到提示驱动加载版本信息:

安装 cuda-drivers 之后,控制台提示版本信息
NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.86.15

Failed to allocate NvKmsKapiDevice 报错

安装 cuda-drivers 驱动之后,重启系统,在终端看到报错:

nv_drm_load 加载失败
[    7.613002] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to allocate NvKmsKapiDevice
[    7.617015] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to register device

由于之前在 在OVMF虚拟机中安装NVIDIA Linux驱动 经验: 需要调整虚拟机内核参数 pci=realloc ,所以尝试修订:

修订 /etc/default/grub 添加 pci=realloc 参数
# GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX_DEFAULT="pci=realloc quiet splash"

好像不是这个原因

检查 dmesg -T 显示有一个 uncorrectable ECC error detected :

dmesg中显示 uncorrectable ECC error
[Fri Feb  7 20:56:16 2025] nvidia: loading out-of-tree module taints kernel.
[Fri Feb  7 20:56:16 2025] nvidia: module license 'NVIDIA' taints kernel.
[Fri Feb  7 20:56:16 2025] Disabling lock debugging due to kernel taint
[Fri Feb  7 20:56:16 2025] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[Fri Feb  7 20:56:16 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[Fri Feb  7 20:56:16 2025] ACPI: \_SB_.LNKD: Enabled at IRQ 11
[Fri Feb  7 20:56:16 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.86.15  Thu Jan 23 23:23:10 UTC 2025
[Fri Feb  7 20:56:16 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.86.15  Thu Jan 23 22:30:06 UTC 2025
[Fri Feb  7 20:56:16 2025] [drm] [nvidia-drm] [GPU ID 0x00000004] Loading driver
[Fri Feb  7 20:56:17 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Fri Feb  7 20:56:17 2025] ACPI Warning: \_SB.PCI0.S20._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20220331/nsarguments-61)
[Fri Feb  7 20:56:17 2025] NVRM: GPU at PCI:0000:00:04: GPU-6db5b7b7-e914-19cd-3bc0-017cf2996a65
[Fri Feb  7 20:56:17 2025] NVRM: Xid (PCI:0000:00:04): 140, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691974, LTC:0, MMU:0, PCIE:0
[Fri Feb  7 20:56:17 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Fri Feb  7 20:56:18 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
[Fri Feb  7 20:56:18 2025] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to allocate NvKmsKapiDevice
[Fri Feb  7 20:56:18 2025] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to register device
[Fri Feb  7 20:56:18 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Fri Feb  7 20:56:18 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Fri Feb  7 20:56:18 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
[Fri Feb  7 20:56:18 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Fri Feb  7 20:56:18 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Fri Feb  7 20:56:18 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
[Fri Feb  7 20:56:26 2025] systemd-journald[243]: Time jumped backwards, rotating.

这个 uncorrectable ECC error 看起来是 VRAM 存在ECC校验硬件错误了,情况和 NVIDIA 论坛 Problems with A100 and Ubuntu 22.04 相似,硬件异常。

还存在疑惑

我尝试将 Tesla T10HPE ProLiant DL360 Gen9服务器PCIe 3 插槽换到 PCIe 1

一点乌龙

这里有点乌龙,我忘记之前 PCIe bifurcationPCIe 1 分为2个,结果发现 Tesla T10 在这种 PCIe bifurcation 通过 vfio-pci passthrough到虚拟机内部,执行启动会出现如下报错:

忘记关闭 PCIe bifurcation 导致的qemu GPU passthrough虚拟机启动报错
qemu-system-x86_64: ../hw/pci/pci.c:1633: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
./run_d2l: line 16:  1143 Aborted                 qemu-system-x86_64 -nodefaults -enable-kvm -cpu host,kvm=off -bios /usr/share/OVMF/OVMF_CODE.fd -m 32G -smp cores=4 -device vfio-pci,host=0000:08:00.0 -drive file=/sources/images/${name}.qcow,if=virtio -net nic,model=virtio,macaddr=52:54:00:00:00:01 -net bridge,br=br0 -vga std -vnc :0 -serial mon:stdio -name "${name}"

Tesla T10 插槽换到 PCIe 1

HPE ProLiant DL360 Gen9服务器 的系统BIOS恢复默认重新设置后,关闭了 PCIe bifurcation ,现在 Tesla T10 插槽在 PCIe 1 ,重新通过vfio-pci直接passthrough到虚拟机内部。这次VM启动后观察,发现同样报 uncorrectable ECC error :

Tesla T10 更换到 PCIe1 但虚拟机启动dmesg还是显示 uncorrectable ECC error
[Sat Feb  8 14:43:37 2025] nvidia: loading out-of-tree module taints kernel.
[Sat Feb  8 14:43:37 2025] nvidia: module license 'NVIDIA' taints kernel.
[Sat Feb  8 14:43:37 2025] Disabling lock debugging due to kernel taint
[Sat Feb  8 14:43:37 2025] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[Sat Feb  8 14:43:38 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[Sat Feb  8 14:43:38 2025] systemd-journald[227]: Time jumped backwards, rotating.
[Sat Feb  8 14:43:38 2025] ACPI: \_SB_.LNKD: Enabled at IRQ 11
[Sat Feb  8 14:43:38 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.86.15  Thu Jan 23 23:23:10 UTC 2025
[Sat Feb  8 14:43:38 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.86.15  Thu Jan 23 22:30:06 UTC 2025
[Sat Feb  8 14:43:38 2025] [drm] [nvidia-drm] [GPU ID 0x00000004] Loading driver
[Sat Feb  8 14:43:38 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb  8 14:43:38 2025] ACPI Warning: \_SB.PCI0.S20._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20220331/nsarguments-61)
[Sat Feb  8 14:43:39 2025] NVRM: GPU at PCI:0000:00:04: GPU-6db5b7b7-e914-19cd-3bc0-017cf2996a65
[Sat Feb  8 14:43:39 2025] NVRM: Xid (PCI:0000:00:04): 140, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691974, LTC:0, MMU:0, PCIE:0
[Sat Feb  8 14:43:39 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb  8 14:43:39 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
[Sat Feb  8 14:43:39 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb  8 14:43:39 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb  8 14:43:39 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0
[Sat Feb  8 14:43:39 2025] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to allocate NvKmsKapiDevice
[Sat Feb  8 14:43:39 2025] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000004] Failed to register device
[Sat Feb  8 14:43:39 2025] nvidia 0000:00:04.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb  8 14:43:39 2025] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb  8 14:43:39 2025] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0

另外,观察到物理主机的控制台上显示报错:

物理主机控制台报错显示 NMI IOCK error
NMI: IOCK error (debug interrupt?) for reason 65 on CPU 0.
NMI: IOCK error (debug interrupt?) for reason 65 on CPU 0.
NMI: IOCK error (debug interrupt?) for reason 65 on CPU 0.
...

改为物理主机使用 Tesla T10 对比

由于我是在淘宝上购买的二手 Tesla T10 ,所以硬件质量不能保证。但是我也不能确定是不是我的使用虚拟化运行问题,所以改为直接使用 Debian 物理主机来使用这块 Tesla T10 。我甚至还重装了一遍 Debian

重启系统后,使用 lspci -vvv 可以看到这块 Tesla T10 使用了对应的nvidia驱动

但是,系统 dmesg 日志还是显示 uncorrectable ECC error detected :

物理主机使用 Tesla T10 依然存在报错
[Sat Feb  8 21:20:38 2025] nvidia: loading out-of-tree module taints kernel.
[Sat Feb  8 21:20:38 2025] nvidia: module license 'NVIDIA' taints kernel.
[Sat Feb  8 21:20:38 2025] Disabling lock debugging due to kernel taint
[Sat Feb  8 21:20:38 2025] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[Sat Feb  8 21:20:38 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 238

[Sat Feb  8 21:20:38 2025] nvidia 0000:08:00.0: enabling device (0140 -> 0142)
[Sat Feb  8 21:20:38 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.86.15  Thu Jan 23 23:23:10 UTC 2025
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6c
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2f6d
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[Sat Feb  8 21:20:38 2025] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[Sat Feb  8 21:20:38 2025] EDAC MC0: Giving out device to module sb_edac controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[Sat Feb  8 21:20:38 2025] EDAC MC1: Giving out device to module sb_edac controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[Sat Feb  8 21:20:38 2025] EDAC MC2: Giving out device to module sb_edac controller Haswell SrcID#0_Ha#1: DEV 0000:7f:12.4 (INTERRUPT)
[Sat Feb  8 21:20:38 2025] EDAC MC3: Giving out device to module sb_edac controller Haswell SrcID#1_Ha#1: DEV 0000:ff:12.4 (INTERRUPT)
[Sat Feb  8 21:20:38 2025] EDAC sbridge:  Ver: 1.1.2 
[Sat Feb  8 21:20:38 2025] intel_rapl_common: Found RAPL domain package
[Sat Feb  8 21:20:38 2025] intel_rapl_common: Found RAPL domain dram
[Sat Feb  8 21:20:38 2025] intel_rapl_common: DRAM domain energy unit 15300pj
[Sat Feb  8 21:20:38 2025] intel_rapl_common: Found RAPL domain package
[Sat Feb  8 21:20:38 2025] intel_rapl_common: Found RAPL domain dram
[Sat Feb  8 21:20:38 2025] intel_rapl_common: DRAM domain energy unit 15300pj
[Sat Feb  8 21:20:38 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.86.15  Thu Jan 23 22:30:06 UTC 2025
[Sat Feb  8 21:20:38 2025] [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
[Sat Feb  8 21:20:38 2025] nvidia 0000:08:00.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb  8 21:20:39 2025] NVRM: GPU at PCI:0000:08:00: GPU-6db5b7b7-e914-19cd-3bc0-017cf2996a65
[Sat Feb  8 21:20:39 2025] NVRM: Xid (PCI:0000:08:00): 140, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691974, LTC:0, MMU:0, PCIE:0
[Sat Feb  8 21:20:39 2025] NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb  8 21:20:39 2025] NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
[Sat Feb  8 21:20:39 2025] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to allocate NvKmsKapiDevice
[Sat Feb  8 21:20:39 2025] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to register device
[Sat Feb  8 21:20:39 2025] DMAR: DRHD: handling fault status reg 2
[Sat Feb  8 21:20:39 2025] DMAR: [DMA Write NO_PASID] Request device [08:00.0] fault addr 0xfbfd0000 [fault reason 0x05] PTE Write access is not set
[Sat Feb  8 21:20:39 2025] DMAR: [DMA Write NO_PASID] Request device [08:00.0] fault addr 0xfbfd0000 [fault reason 0x05] PTE Write access is not set
[Sat Feb  8 21:20:39 2025] DMAR: [DMA Write NO_PASID] Request device [08:00.0] fault addr 0xfbfd0000 [fault reason 0x05] PTE Write access is not set
[Sat Feb  8 21:20:39 2025] DMAR: [DMA Write NO_PASID] Request device [08:00.0] fault addr 0xfff47000 [fault reason 0x05] PTE Write access is not set
[Sat Feb  8 21:20:39 2025] DMAR: DRHD: handling fault status reg 400
[Sat Feb  8 21:20:39 2025] DMAR: DRHD: handling fault status reg 402
[Sat Feb  8 21:20:39 2025] DMAR: [DMA Write NO_PASID] Request device [08:00.0] fault addr 0xfbff0000 [fault reason 0x05] PTE Write access is not set
[Sat Feb  8 21:20:39 2025] nvidia 0000:08:00.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb  8 21:20:39 2025] NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb  8 21:20:39 2025] NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0
[Sat Feb  8 21:20:39 2025] nvidia 0000:08:00.0: firmware: direct-loading firmware nvidia/570.86.15/gsp_tu10x.bin
[Sat Feb  8 21:20:39 2025] NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[Sat Feb  8 21:20:39 2025] NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 0

参考NVIDIA官方文档 Xid Errors 其中 Xid Errors => 140 Unrecovered ECC Error 表示 GPU driver has observed uncorrectable errors in GPU memory, in such a way as to interrupt the GPU driver’s ability to mark the pages for dynamic page offlining or row remapping

很不幸,这次实践最后没有完成 Tesla T10 硬件异常,最后退还给淘宝卖家了。等以后有机会再做探索...