Bhyve环境Ubuntu虚拟机运行Tesla P4 GPU的Docker
我曾经想在 树莓派Raspberry Pi 5 的硬件上通过 树莓派5 PCIe转M.2 NVMe SSD存储 转接卡方式连接 Nvidia Tesla P4 GPU运算卡 来实现一个低功耗 Machine Learning 环境,但是遇到了不少挫折:
树莓派 Raspberry Pi OS 安装NVIDIA驱动(归档) 实践验证无法正常编译CUDA driver的内核模块
树莓派安装NVIDIA P4 GPU运行 nvidia-docker 容器 回归到标准版Ubuntu确实能够完成CUDA driver安装,但是很不幸 树莓派Raspberry Pi 5 硬件无法支持外接 NVIDIA GPU 惨淡失败
我回归到标准的x86硬件环境,采用组装台式机来运行 FreeBSD 操作系统,计划构建一个 FreeBSD机器学习 环境:
采用 bhyve(BSD hypervisor) 虚拟化来运行 Ubuntu Linux 服务器
构建 Docker GPU设备 环境的单GPU到多GPU的 LLM 大型语言模型 推理
最终构建 Kubernetes 调度运行
本文是开始的第一步,也就是为 在bhyve中实现NVIDIA GPU passthrough 运行的Ubuntu虚拟机 在Ubuntu安装NVIDIA CUDA
准备工作
启动虚拟机后检查
dmesg
此时因为还没有安装 CUDA driver,所以看到的驱动是nouveau
[Mon Jul 28 06:01:24 2025] nouveau 0000:00:06.0: NVIDIA GP104 (134000a1)
采用 Debian精简系统初始化 纯后台服务器系统安装开发工具的方式(安装
build-essential
为主)
sudo apt install build-essential cmake vim-nox python3-dev -y
CUDA驱动需要内核头文件以及开发工具包来完成内核相关的驱动安装,因为内核驱动需要根据内核进行编译
安装 linux-headers :
apt-get install linux-headers-$(uname -r)
安装CUDA driver
从NVIDIA官方提供 NVIDIA CUDA Toolkit repo 下载 选择
linux
=>x86_64
=>Ubuntu
=>24.04
=>deb(network)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
安装驱动
cuda-driver
:
sudo apt-get -y install cuda-drivers
重启虚拟机操作系统
检查
重启后检查pci设备
lspci
显示输出 Nvidia Tesla P4 GPU运算卡 如下:
lspci
显示设备00:00.0 Host bridge: Network Appliance Corporation Device 1275
00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:06.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
00:07.0 VGA compatible controller: Device fb5d:40fb
00:1f.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
检查设备 00:06.0
详情 lspci -v -s 00:06.0
:
lspci
显示设备 Tesla P4
驱动是 nvidia
(刚才安装的官方驱动)00:06.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
Subsystem: NVIDIA Corporation GP104GL [Tesla P4]
Flags: bus master, fast devsel, latency 0, IRQ 37
Memory at c1000000 (32-bit, non-prefetchable) [size=16M]
Memory at 800000000 (64-bit, prefetchable) [size=256M]
Memory at 810000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
异常
执行
nvidia-smi
检查NVIDIA设备,发现异常(没有发现设备):No devices were found
检查
dmesg | grep -i nvidia
日志看到了奇怪的现象:
nvidia-drm
加载驱动是不能分配 NvKmsKapiDevice
[ 3.138804] nvidia: loading out-of-tree module taints kernel.
[ 3.138820] nvidia: module license 'NVIDIA' taints kernel.
[ 3.138821] Disabling lock debugging due to kernel taint
[ 3.138824] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 3.138825] nvidia: module license taints kernel.
[ 3.235358] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 3.238119] nvidia 0000:00:08.0: can't derive routing for PCI INT A
[ 3.238537] nvidia 0000:00:08.0: PCI INT A: no GSI - using ISA IRQ 11
[ 3.487802] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 570.172.08 Tue Jul 8 18:31:33 UTC 2025
[ 3.517558] loop0: detected capacity change from 0 to 8
[ 3.574312] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 570.172.08 Tue Jul 8 17:57:10 UTC 2025
[ 3.646241] NET: Registered PF_QIPCRTR protocol family
[ 3.781469] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0xffff:1496)
[ 3.837923] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
[ 3.841712] [drm] [nvidia-drm] [GPU ID 0x00000008] Loading driver
[ 3.842539] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0xffff:1496)
[ 3.842792] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
[ 3.847906] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0xffff:1496)
[ 3.848132] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
[ 3.848224] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000008] Failed to allocate NvKmsKapiDevice
[ 3.850085] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000008] Failed to register device
尝试回滚了一个版本( 575 -> 570 ),但是问题依旧(参考 Ubuntu 22.04.1 LTS, RTX 3060Ti, Failed to allocate NvKmsKapiDevice )
sudo apt remove nvidia* && \
sudo apt autoremove && \
sudo apt install --reinstall nvidia-driver-570
可能还是要回到 在bhyve中实现NVIDIA GPU passthrough 寻求解决方案