Bhyve环境Ubuntu虚拟机运行Tesla P4 GPU的Docker

我曾经想在 树莓派Raspberry Pi 5 的硬件上通过 树莓派5 PCIe转M.2 NVMe SSD存储 转接卡方式连接 Nvidia Tesla P4 GPU运算卡 来实现一个低功耗 Machine Learning 环境,但是遇到了不少挫折:

我回归到标准的x86硬件环境,采用组装台式机来运行 FreeBSD 操作系统,计划构建一个 FreeBSD机器学习 环境:

本文是开始的第一步,也就是为 在bhyve中实现NVIDIA GPU passthrough 运行的Ubuntu虚拟机 在Ubuntu安装NVIDIA CUDA

准备工作

  • 启动虚拟机后检查 dmesg 此时因为还没有安装 CUDA driver,所以看到的驱动是 nouveau

    [Mon Jul 28 06:01:24 2025] nouveau 0000:00:06.0: NVIDIA GP104 (134000a1)
    
  • 采用 Debian精简系统初始化 纯后台服务器系统安装开发工具的方式(安装 build-essential 为主)

安装纯后台开发工具
sudo apt install build-essential cmake vim-nox python3-dev -y
  • CUDA驱动需要内核头文件以及开发工具包来完成内核相关的驱动安装,因为内核驱动需要根据内核进行编译

安装 linux-headers :

安装inux-headers
apt-get install linux-headers-$(uname -r)

安装CUDA driver

Debian/Ubuntu使用NVIDIA官方软件仓库安装CUDA驱动
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
  • 安装驱动 cuda-driver :

Debian/Ubuntu使用NVIDIA官方软件仓库安装CUDA驱动
sudo apt-get -y install cuda-drivers
  • 重启虚拟机操作系统

检查

lspci 显示设备
00:00.0 Host bridge: Network Appliance Corporation Device 1275
00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:06.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
00:07.0 VGA compatible controller: Device fb5d:40fb
00:1f.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]

检查设备 00:06.0 详情 lspci -v -s 00:06.0 :

lspci 显示设备 Tesla P4 驱动是 nvidia (刚才安装的官方驱动)
00:06.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
	Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
	Flags: bus master, fast devsel, latency 0, IRQ 37
	Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 800000000 (64-bit, prefetchable) [size=256M]
	Memory at 810000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

异常

  • 执行 nvidia-smi 检查NVIDIA设备,发现异常(没有发现设备):

    No devices were found
    
  • 检查 dmesg | grep -i nvidia 日志看到了奇怪的现象:

系统日志显示 nvidia-drm 加载驱动是不能分配 NvKmsKapiDevice
[    3.138804] nvidia: loading out-of-tree module taints kernel.
[    3.138820] nvidia: module license 'NVIDIA' taints kernel.
[    3.138821] Disabling lock debugging due to kernel taint
[    3.138824] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    3.138825] nvidia: module license taints kernel.
[    3.235358] nvidia-nvlink: Nvlink Core is being initialized, major device number 239

[    3.238119] nvidia 0000:00:08.0: can't derive routing for PCI INT A
[    3.238537] nvidia 0000:00:08.0: PCI INT A: no GSI - using ISA IRQ 11
[    3.487802] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.172.08  Tue Jul  8 18:31:33 UTC 2025
[    3.517558] loop0: detected capacity change from 0 to 8
[    3.574312] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.172.08  Tue Jul  8 17:57:10 UTC 2025
[    3.646241] NET: Registered PF_QIPCRTR protocol family
[    3.781469] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0xffff:1496)
[    3.837923] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
[    3.841712] [drm] [nvidia-drm] [GPU ID 0x00000008] Loading driver
[    3.842539] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0xffff:1496)
[    3.842792] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
[    3.847906] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0xffff:1496)
[    3.848132] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
[    3.848224] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000008] Failed to allocate NvKmsKapiDevice
[    3.850085] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000008] Failed to register device
回滚一个版本
sudo apt remove nvidia* && \
sudo apt autoremove && \
sudo apt install --reinstall nvidia-driver-570

可能还是要回到 在bhyve中实现NVIDIA GPU passthrough 寻求解决方案