树莓派 Raspberry Pi OS
安装NVIDIA驱动(归档)
备注
我在 树莓派安装NVIDIA P4 GPU运行 nvidia-docker 容器 实践走了弯路,在安装 nvidia-driver
步骤编译 动态内核模块支持(DKMS) 内核模块时折腾了两天。为了精简 树莓派安装NVIDIA P4 GPU运行 nvidia-docker 容器 记录,我把这段安装驱动的过程汇总到本文作为一个学习实践的笔记。
仅供参考
警告
在Raspberry Pi OS上安装 cuda-driver
没有成功!!!
我在网上搜索树莓派上安装NVIDIA GPU的资料,发现几乎都是语焉不详或者步骤不清晰或矛盾,无法确定真正安装成功。所以我准备切换到标准版Ubuntu,重新开始安装 nvidia-driver
Nvidia Tesla P4 GPU运算卡 加电后再启动连接的 树莓派Raspberry Pi 5 ,进入host主机系统后执行 lspci
命令可以看到识别出 Nvidia Tesla P4 GPU运算卡 :
0001:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries BCM2712 PCIe Bridge (rev 21)
0001:01:00.0 PCI bridge: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch
0001:02:03.0 PCI bridge: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch
0001:02:07.0 PCI bridge: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch
0001:03:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
0002:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries BCM2712 PCIe Bridge (rev 21)
0002:01:00.0 Ethernet controller: Raspberry Pi Ltd RP1 PCIe 2.0 South Bridge
Host主机安装 nvidia-driver
如上文所述,在 树莓派Raspberry Pi 5 Host主机上我规划部署 Docker (作为 Kubernetes 主机节点),所以只需要安装 cuda-drivers
备注
在 安装NVIDIA Linux驱动 我曾经采用过两种方式安装 cuda-drivers
:
手工下载安装 NVIDIA官方提供的 P40 驱动
通过Linux发行版软件仓库方式安装NVDIA CUDA驱动
本次实践我采用后者 软件仓库方式
准备工作
按照 安装CUDA准备 检查和准备:
由于 树莓派Raspberry Pi 5 只有8GB内存,所以不建议启用 Above 4G Decoding BIOS设置 (应该也没有这个BIOS设置选项)
验证系统已经安装gcc以及对应版本:
gcc --version
输出显示目前系统安装了 gcc 12
:
gcc 12
gcc (Debian 12.2.0-14+deb12u1) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
CUDA驱动需要内核头文件以及开发工具包来完成内核相关的驱动安装,因为内核驱动需要根据内核进行编译。这里按照 Debian / Ubuntu Linux 安装对应内核版本的头文件包:
apt-get install linux-headers-$(uname -r)
我也参考 Raspberry Pi Documentation: The Linux kernel#kernel-headers 安装 树莓派专用linux-headers :
apt install linux-headers-rpi-v8
但是在后续 CUDA软件仓库
安装过程都出现相同的编译错误,所以看起来在 Raspberry Pi OS 安装 nvidia-driver
编译存在问题。
CUDA软件仓库
从NVIDIA官方提供 NVIDIA CUDA Toolkit repo 下载
由于是 树莓派Raspberry Pi 5 ARM架构,我选择了
Linux >> arm64-sbsa (Server Base System Architecture) >> Native >> Ubuntu >> 22.04 >> deb (network)
Compilation 步骤可选
Native
(只编译相同架构的代码)和Cross
(可编译不同架构代码),我选择Native
Ubuntu版本选择
22.04
对应的是 Debian 12 (bookworm),如果选 Ubuntu 24.04 则对应的是debian 13
安装步骤:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/sbsa/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
仓库安装 cuda-drivers
备注
使用软件仓库网络安装 cuda-drivers
需要主机安装好对应的 linux-headers
sudo apt-get -y install cuda-drivers
安装过程会爱用 动态内核模块支持(DKMS) 编译NVIDIA内核模块,并且会提示添加了 /etc/modprobe.d/nvidia-graphics-drivers.conf
来 blacklist
阻止加载冲突的 Nouveau
开源驱动,并且提示需要重启操作系统来完成驱动验证加载。
CUDA软件本地安装
备注
使用本地安装 cuda-drivers
需要本地安装好内核源代码,这里采用 Raspberry Pi Documentation: The Linux kernel#Build the kernel 下载Raspberry Pi 内核源代码
JeffGeerling的网站上 Raspberry Pi PCIe Database#GPUs (Graphics Cards) 列出的NVIDIA显卡,他采用了下载最新驱动软件安装包方法,本地运行安装
chmod +x NVIDIA-Linux-aarch64-575.64.03.run
./NVIDIA-Linux-aarch64-575.64.03.run
本地安装会提示需要当前运行内核的源代码树,否则会报错
按照 Raspberry Pi Documentation: The Linux kernel#Build the kernel 下载Raspberry Pi 内核源代码:
git clone --depth=1 https://github.com/raspberrypi/linux
# 当前内核版本是 6.12.34+rpt-rpi-2712
cp /boot/config-6.12.34+rpt-rpi-2712 /usr/src/linux/.config
警告
这里遇到一个运行 报错:
version.h
不存在ERROR: Neither the '/usr/src/linux/include/linux/version.h' nor the
'/usr/src/linux/include/generated/uapi/linux/version.h' kernel header file
exists. The most likely reason for this is that the kernel source files in
'/usr/src/linux' have not been configured.
我感觉确实很难在 Raspberry Pi OS 上完成 nvidia-drivers
安装,网上的案例信息实际上都没有明确说明 Raspberry Pi OS 安装(没有详细步骤或者步骤存在矛盾),所以我感觉需要切换到标准版本 Ubuntu 来完成
安装 cuda-drivers
报错: stdarg.h
我这里遇到报错(编译内核错误)
..
Loading new nvidia/575.57.08 DKMS files...
Building for 6.6.51+rpt-rpi-2712, 6.6.51+rpt-rpi-v8, 6.12.25+rpt-rpi-2712 and 6.12.25+rpt-rpi-v8
Building initial module nvidia/575.57.08 for 6.6.51+rpt-rpi-2712
The kernel is built without module signing facility, modules won't be signed
Building module(s)...........(bad exit status: 2)
Failed command:
'make' -j4 KERNEL_UNAME=6.6.51+rpt-rpi-2712 IGNORE_CC_MISMATCH=1 SYSSRC=/lib/modules/6.6.51+rpt-rpi-2712/build LD=/usr/bin/ld.bfd
CONFIG_X86_KERNEL_IBT= modules
Error! Bad return status for module build on kernel: 6.6.51+rpt-rpi-2712 (aarch64)
Consult /var/lib/dkms/nvidia/575.57.08/build/make.log for more information.
dpkg: error processing package nvidia-dkms-575 (--configure):
installed nvidia-dkms-575 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of nvidia-driver-575:
nvidia-driver-575 depends on nvidia-dkms-575 (= 575.57.08-0ubuntu1); however:
Package nvidia-dkms-575 is not configured yet.
dpkg: error processing package nvidia-driver-575 (--configure):
dependency problems - leaving unconfigured
Setting up libglx-mesa0:arm64 (24.2.8-1~bpo12+rpt3) ...
Setting up libglx0:arm64 (1.6.0-1) ...
dpkg: dependency problems prevent configuration of cuda-drivers-575:
cuda-drivers-575 depends on nvidia-driver-575 (>= 575.57.08) | nvidia-driver-575-open (>= 575.57.08) | nvidia-driver-575-server
(>= 575.57.08) | nvidia-driver-575-server-open (>= 575.57.08); however:
Package nvidia-driver-575 is not configured yet.
Package nvidia-driver-575-open is not installed.
Package nvidia-driver-575-server is not installed.
Package nvidia-driver-575-server-open is not installed.
dpkg: error processing package cuda-drivers-575 (--configure):
dependency problems - leaving unconfigured
Setting up libgl1:arm64 (1.6.0-1) ...
dpkg: dependency problems prevent configuration of cuda-drivers:
cuda-drivers depends on cuda-drivers-575 (= 575.57.08-0ubuntu1); however:
Package cuda-drivers-575 is not configured yet.
dpkg: error processing package cuda-drivers (--configure):
dependency problems - leaving unconfigured
...
检查错误日志 /var/lib/dkms/nvidia/575.57.08/build/make.log
可以看到,显示缺少 stdarg.h
:
CONFTEST: ib_peer_memory_symbols
CC [M] /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-platform.o
CC [M] /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-dsi-parse-panel-props.o
CC [M] /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-bpmp.o
CC [M] /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-gpio.o
In file included from /var/lib/dkms/nvidia/575.57.08/build/common/inc/conftest.h:28,
from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:29,
from /var/lib/dkms/nvidia/575.57.08/build/common/inc/os-interface.h:40,
from /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-dsi-parse-panel-props.c:26:
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:77:2: error: #error dma_buf_export() conftest failed!
77 | #error dma_buf_export() conftest failed!
| ^~~~~
In file included from /var/lib/dkms/nvidia/575.57.08/build/common/inc/conftest.h:28,
from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:29,
from /var/lib/dkms/nvidia/575.57.08/build/common/inc/os-interface.h:40,
from /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-bpmp.c:26:
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:77:2: error: #error dma_buf_export() conftest failed!
77 | #error dma_buf_export() conftest failed!
| ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:94:2: error: #error radix_tree_replace_slot() conftest failed!
94 | #error radix_tree_replace_slot() conftest failed!
| ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:94:2: error: #error radix_tree_replace_slot() conftest failed!
94 | #error radix_tree_replace_slot() conftest failed!
| ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:33:14: fatal error: stdarg.h: No such file or directory
33 | #include <stdarg.h>
| ^~~~~~~~~~
/var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:33:14: fatal error: stdarg.h: No such file or directory
33 | #include <stdarg.h>
| ^~~~~~~~~~
compilation terminated.
compilation terminated.
make[3]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/scripts/Makefile.build:248: /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-dsi-parse-panel-props.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[3]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/scripts/Makefile.build:248: /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-bpmp.o] Error 1
In file included from /var/lib/dkms/nvidia/575.57.08/build/common/inc/conftest.h:28,
from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:29,
from /var/lib/dkms/nvidia/575.57.08/build/common/inc/os-interface.h:40,
from /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-gpio.c:26:
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:77:2: error: #error dma_buf_export() conftest failed!
77 | #error dma_buf_export() conftest failed!
| ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:94:2: error: #error radix_tree_replace_slot() conftest failed!
94 | #error radix_tree_replace_slot() conftest failed!
| ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:33:14: fatal error: stdarg.h: No such file or directory
33 | #include <stdarg.h>
| ^~~~~~~~~~
compilation terminated.
make[3]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/scripts/Makefile.build:248: /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-gpio.o] Error 1
In file included from /var/lib/dkms/nvidia/575.57.08/build/common/inc/conftest.h:28,
from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:29,
from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv.h:41,
from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv-linux.h:28,
from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv-platform.h:27,
from /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-platform.c:32:
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:77:2: error: #error dma_buf_export() conftest failed!
77 | #error dma_buf_export() conftest failed!
| ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:94:2: error: #error radix_tree_replace_slot() conftest failed!
94 | #error radix_tree_replace_slot() conftest failed!
| ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:33:14: fatal error: stdarg.h: No such file or directory
33 | #include <stdarg.h>
| ^~~~~~~~~~
compilation terminated.
make[3]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/scripts/Makefile.build:248: /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-platform.o] Error 1
make[2]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/Makefile:1946: /var/lib/dkms/nvidia/575.57.08/build] Error 2
make[1]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/Makefile:246: __sub-make] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-6.6.51+rpt-rpi-2712'
make: *** [Makefile:140: modules] Error 2
这里提示 stdarg.h: No such file or directory
,看起来似乎指gcc自带的头文件: /usr/lib/gcc/aarch64-linux-gnu/12/include/stdarg.h
在 #include <stdarg.h> missing in 418.113 #46 提到不同内核版本需要修订
#if LINUX_VERSION_CODE < KERNEL_VERSION(5, 15, 0)
#include <stdarg.h>
#else
#include <linux/stdarg.h>
#endif
检查 stdarg.h
文件在哪里:
stdarg.h
find /usr -name stdarg.h 2>/dev/null
可以看到linux头文件确实在 linux/stdarg.h
:
stdarg.h
输出显示 linux/stdarg.h
/usr/include/c++/12/tr1/stdarg.h
/usr/lib/gcc/aarch64-linux-gnu/12/include/stdarg.h
/usr/src/linux-headers-6.12.34+rpt-common-rpi/include/linux/stdarg.h
/usr/src/linux-headers-6.12.25+rpt-common-rpi/include/linux/stdarg.h
参考 nvidia installer can't find stdarg.h #6 建议在源代码header目录下创建 <linux/stdarg.h>
到 stdarg.h>
的软连接,感觉这个方法也好:
linux/stdarg.h
到 stdarg.h
的软连接cd /usr/src/linux-headers-6.12.34+rpt-common-rpi/include/
ln -s linux/stdarg.h stdarg.h
cd /usr/src/linux-headers-6.12.25+rpt-common-rpi/include/
ln -s linux/stdarg.h stdarg.h
然后重新安装,则这个找不到 stdarg.h
的问题解决了(虽然还是有其他报错)
cc1: some warnings being treated as errors
在解决了 stdarg.h
无法找到的问题之后,编译日志中出现大量报错,其中有很多行显示:
cc1: some warnings being treated as errors
考虑到是不是WARNING被视为ERROR导致编译不通过,所以想修订 make
的 CFLAGS
配置。在 Gentoo Linux 中,有一个全局的 /etc/make.conf
配置可以设置Gentoo的编译参数,那么Debian如何设置呢?
CFLAGS=" -Wno-error=..."
makefile提供了一个参数可以将某些warning不视为error,举例刚才的编译日志中,很多WARNONG是 -Wmisssing-prototypes
,所以我需要忽略这个WARNING
则应该在 CFLAGS
中添加 -Wno-error=missing-prototypes
修订配置的方法通常是在项目软件目录下修改 makefile ,例如:
# 这里忽略所有WARNING只是一个案例,实际并不推荐忽略所有WARNING
override CFLAGS += -Wall
app: main.c
gcc $(CFLAGS) -o app main.c
dpkg-buildflags
dpkg-buildflags
在package编译时返回build的flags,默认的配置定义在 /usr/local/etc/dpkg/buildflags.conf
。对应当前用户则是 $XDG_CONFIG_HOME/dpkg/buildflags.conf
(默认的 $XDG_CONFIG_HOME
就是 $HOME/.config
,也就是当前用户的配置是 ~/.config/dpkg/buildflags.conf
不过,我发现我的情况不是对软件源代码包进行编译,具体参考 How to override dpkg-buildflags CFLAGS? 。只有 apt-get source <pkg-name>
下载软件源代码包才使用这个 dpkg-buildflags
方法
修订 动态内核模块支持(DKMS) 编译参数
注意到这个编译模块是 动态内核模块支持(DKMS) ,参考 building with clang rather than gcc #124 ,对于 dkms ,会使用一个 dkms.conf
来控制编译。所以我搜索了一下,发现在 /usr/src/nvidia-575.57.08/dkms.conf
但我没有找到修改方法参考
不过, grep
了一下 /usr/src/nvidia-575.57.08
源代码目录,发现在该目录下有一个 Kbuild
文件包含了 -Wno-error
配置,当前配置是:
Kbuild
配置 CFLAGS
...
NV_CONFTEST_CFLAGS = $(NV_CFLAGS_FROM_CONFTEST) $(ccflags-y) -fno-pie
NV_CONFTEST_CFLAGS += $(filter -std=%,$(KBUILD_CFLAGS))
NV_CONFTEST_CFLAGS += $(call cc-disable-warning,pointer-sign)
NV_CONFTEST_CFLAGS += $(call cc-option,-fshort-wchar,)
NV_CONFTEST_CFLAGS += $(call cc-option,-Werror=incompatible-pointer-types,)
NV_CONFTEST_CFLAGS += -Wno-error
...
参考
Using NVIDIA GPU within Docker Containers (在安装
NVIDIA Container Toolkit
之前,先参考 CUDA Installation Guide for Linux 完成cuda-driver
安装)Installing the NVIDIA Container Toolkit 从2023年8月开始,
nvidia-docker
已经被NVIDIA Container Toolkit
替代,所以本文实践部署替代了之前的 Docker运行NVIDIA容器How can I compile without warnings being treated as errors? 和 How to suppress all warnings being treated as errors for format-truncation 提供了如何忽略某些WARNING的方法
Append to GNU 'make' variables via the command line 如何设置makefile的CFLAGS
Configuration cflags compilation of debian 建议使用
dpkg-buildflags
来调整编译参数,具体可以参考 dpkg-buildflags man-pages