Bhyve环境Ubuntu虚拟机运行Tesla P4 GPU的Docker

我曾经想在树莓派Raspberry Pi 5 的硬件上通过树莓派5 PCIe转M.2 NVMe SSD存储转接卡方式连接 Nvidia Tesla P4 GPU运算卡来实现一个低功耗 Machine Learning 环境，但是遇到了不少挫折:

树莓派 Raspberry Pi OS 安装NVIDIA驱动(归档) 实践验证无法正常编译CUDA driver的内核模块
树莓派安装NVIDIA P4 GPU运行 nvidia-docker 容器回归到标准版Ubuntu确实能够完成CUDA driver安装，但是很不幸树莓派Raspberry Pi 5 硬件无法支持外接 NVIDIA GPU 惨淡失败

我回归到标准的x86硬件环境，采用组装台式机来运行 FreeBSD 操作系统，计划构建一个 FreeBSD机器学习环境:

采用 bhyve(BSD hypervisor) 虚拟化来运行 Ubuntu Linux 服务器
构建 Docker GPU设备环境的单GPU到多GPU的 LLM 大型语言模型推理
最终构建 Kubernetes 调度运行

本文是开始的第一步，也就是为在bhyve中实现NVIDIA GPU passthrough 运行的Ubuntu虚拟机在Ubuntu安装NVIDIA CUDA

准备工作

启动虚拟机后检查 dmesg 此时因为还没有安装 CUDA driver，所以看到的驱动是 nouveau
```
[Mon Jul 28 06:01:24 2025] nouveau 0000:00:06.0: NVIDIA GP104 (134000a1)
```
采用 Debian精简系统初始化纯后台服务器系统安装开发工具的方式(安装 build-essential 为主)

安装纯后台开发工具

sudo apt install build-essential cmake vim-nox python3-dev -y

CUDA驱动需要内核头文件以及开发工具包来完成内核相关的驱动安装，因为内核驱动需要根据内核进行编译

安装 linux-headers :

安装inux-headers

apt-get install linux-headers-$(uname -r)

安装CUDA driver

从NVIDIA官方提供 NVIDIA CUDA Toolkit repo 下载选择 linux => x86_64 => Ubuntu => 24.04 => deb(network)

Debian/Ubuntu使用NVIDIA官方软件仓库安装CUDA驱动

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

安装驱动 cuda-driver :

Debian/Ubuntu使用NVIDIA官方软件仓库安装CUDA驱动

sudo apt-get -y install cuda-drivers

重启虚拟机操作系统

检查

重启后检查pci设备 lspci 显示输出 Nvidia Tesla P4 GPU运算卡如下:

lspci 显示设备

00.0 Host bridge: Network Appliance Corporation Device 1275
04.0 SCSI storage controller: Red Hat, Inc. Virtio block device
05.0 Ethernet controller: Red Hat, Inc. Virtio network device
06.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
07.0 VGA compatible controller: Device fb5d:40fb
1f.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]

检查设备 00:06.0 详情 lspci -v -s 00:06.0 :

lspci 显示设备 Tesla P4 驱动是 nvidia (刚才安装的官方驱动)

00:06.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
	Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
	Flags: bus master, fast devsel, latency 0, IRQ 37
	Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 800000000 (64-bit, prefetchable) [size=256M]
	Memory at 810000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

异常

执行 nvidia-smi 检查NVIDIA设备，发现异常(没有发现设备):
```
No devices were found
```
检查 dmesg | grep -i nvidia 日志看到了奇怪的现象:

系统日志显示 nvidia-drm 加载驱动是不能分配 NvKmsKapiDevice

[    3.138804] nvidia: loading out-of-tree module taints kernel.
[    3.138820] nvidia: module license 'NVIDIA' taints kernel.
[    3.138821] Disabling lock debugging due to kernel taint
[    3.138824] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    3.138825] nvidia: module license taints kernel.
[    3.235358] nvidia-nvlink: Nvlink Core is being initialized, major device number 239

[    3.238119] nvidia 0000:00:08.0: can't derive routing for PCI INT A
[    3.238537] nvidia 0000:00:08.0: PCI INT A: no GSI - using ISA IRQ 11
[    3.487802] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  570.172.08  Tue Jul  8 18:31:33 UTC 2025
[    3.517558] loop0: detected capacity change from 0 to 8
[    3.574312] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  570.172.08  Tue Jul  8 17:57:10 UTC 2025
[    3.646241] NET: Registered PF_QIPCRTR protocol family
[    3.781469] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0xffff:1496)
[    3.837923] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
[    3.841712] [drm] [nvidia-drm] [GPU ID 0x00000008] Loading driver
[    3.842539] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0xffff:1496)
[    3.842792] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
[    3.847906] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0xffff:1496)
[    3.848132] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
[    3.848224] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000008] Failed to allocate NvKmsKapiDevice
[    3.850085] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000008] Failed to register device

尝试回滚了一个版本( 575 -> 570 )，但是问题依旧(参考 Ubuntu 22.04.1 LTS, RTX 3060Ti, Failed to allocate NvKmsKapiDevice )

回滚一个版本

sudo apt remove nvidia* && \
sudo apt autoremove && \
sudo apt install --reinstall nvidia-driver-570

可能还是要回到在bhyve中实现NVIDIA GPU passthrough 寻求解决方案