树莓派 Raspberry Pi OS 安装NVIDIA驱动(归档)

备注

我在 树莓派安装NVIDIA P4 GPU运行 nvidia-docker 容器 实践走了弯路,在安装 nvidia-driver 步骤编译 动态内核模块支持(DKMS) 内核模块时折腾了两天。为了精简 树莓派安装NVIDIA P4 GPU运行 nvidia-docker 容器 记录,我把这段安装驱动的过程汇总到本文作为一个学习实践的笔记。

仅供参考

警告

在Raspberry Pi OS上安装 cuda-driver 没有成功!!!

我在网上搜索树莓派上安装NVIDIA GPU的资料,发现几乎都是语焉不详或者步骤不清晰或矛盾,无法确定真正安装成功。所以我准备切换到标准版Ubuntu,重新开始安装 nvidia-driver

Nvidia Tesla P4 GPU运算卡 加电后再启动连接的 树莓派Raspberry Pi 5 ,进入host主机系统后执行 lspci 命令可以看到识别出 Nvidia Tesla P4 GPU运算卡 :

启动系统后检查识别的 Nvidia Tesla P4 GPU运算卡
0001:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries BCM2712 PCIe Bridge (rev 21)
0001:01:00.0 PCI bridge: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch
0001:02:03.0 PCI bridge: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch
0001:02:07.0 PCI bridge: ASMedia Technology Inc. ASM1182e 2-Port PCIe x1 Gen2 Packet Switch
0001:03:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
0002:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries BCM2712 PCIe Bridge (rev 21)
0002:01:00.0 Ethernet controller: Raspberry Pi Ltd RP1 PCIe 2.0 South Bridge

Host主机安装 nvidia-driver

如上文所述,在 树莓派Raspberry Pi 5 Host主机上我规划部署 Docker (作为 Kubernetes 主机节点),所以只需要安装 cuda-drivers

备注

安装NVIDIA Linux驱动 我曾经采用过两种方式安装 cuda-drivers :

本次实践我采用后者 软件仓库方式

准备工作

按照 安装CUDA准备 检查和准备:

检查系统安装的gcc版本
gcc --version

输出显示目前系统安装了 gcc 12 :

检查系统安装的gcc版本显示是 gcc 12
gcc (Debian 12.2.0-14+deb12u1) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  • CUDA驱动需要内核头文件以及开发工具包来完成内核相关的驱动安装,因为内核驱动需要根据内核进行编译。这里按照 Debian / Ubuntu Linux 安装对应内核版本的头文件包:

安装内核版本对应的头文件包
apt-get install linux-headers-$(uname -r)

我也参考 Raspberry Pi Documentation: The Linux kernel#kernel-headers 安装 树莓派专用linux-headers :

安装Raspberry Pi OS特定linux-headers
apt install linux-headers-rpi-v8

但是在后续 CUDA软件仓库 安装过程都出现相同的编译错误,所以看起来在 Raspberry Pi OS 安装 nvidia-driver 编译存在问题。

CUDA软件仓库

从NVIDIA官方提供 NVIDIA CUDA Toolkit repo 下载

  • 由于是 树莓派Raspberry Pi 5 ARM架构,我选择了 Linux >> arm64-sbsa (Server Base System Architecture) >> Native >> Ubuntu >> 22.04 >> deb (network)

    • Compilation 步骤可选 Native (只编译相同架构的代码)和 Cross (可编译不同架构代码),我选择 Native

    • Ubuntu版本选择 22.04 对应的是 Debian 12 (bookworm),如果选 Ubuntu 24.04 则对应的是debian 13

  • 安装步骤:

安装仓库
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/sbsa/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

仓库安装 cuda-drivers

备注

使用软件仓库网络安装 cuda-drivers 需要主机安装好对应的 linux-headers

Debian/Ubuntu使用NVIDIA官方软件仓库安装CUDA驱动
sudo apt-get -y install cuda-drivers

安装过程会爱用 动态内核模块支持(DKMS) 编译NVIDIA内核模块,并且会提示添加了 /etc/modprobe.d/nvidia-graphics-drivers.confblacklist 阻止加载冲突的 Nouveau 开源驱动,并且提示需要重启操作系统来完成驱动验证加载。

CUDA软件本地安装

备注

使用本地安装 cuda-drivers 需要本地安装好内核源代码,这里采用 Raspberry Pi Documentation: The Linux kernel#Build the kernel 下载Raspberry Pi 内核源代码

JeffGeerling的网站上 Raspberry Pi PCIe Database#GPUs (Graphics Cards) 列出的NVIDIA显卡,他采用了下载最新驱动软件安装包方法,本地运行安装

本地安装
chmod +x NVIDIA-Linux-aarch64-575.64.03.run
./NVIDIA-Linux-aarch64-575.64.03.run

本地安装会提示需要当前运行内核的源代码树,否则会报错

按照 Raspberry Pi Documentation: The Linux kernel#Build the kernel 下载Raspberry Pi 内核源代码:

下载树莓派系统的内核代码
git clone --depth=1 https://github.com/raspberrypi/linux

# 当前内核版本是 6.12.34+rpt-rpi-2712
cp /boot/config-6.12.34+rpt-rpi-2712 /usr/src/linux/.config

警告

这里遇到一个运行 报错:

运行提示内核版本 version.h 不存在
ERROR: Neither the '/usr/src/linux/include/linux/version.h' nor the 
'/usr/src/linux/include/generated/uapi/linux/version.h' kernel header file 
exists. The most likely reason for this is that the kernel source files in 
'/usr/src/linux' have not been configured.

我感觉确实很难在 Raspberry Pi OS 上完成 nvidia-drivers 安装,网上的案例信息实际上都没有明确说明 Raspberry Pi OS 安装(没有详细步骤或者步骤存在矛盾),所以我感觉需要切换到标准版本 Ubuntu 来完成

安装 cuda-drivers 报错: stdarg.h

我这里遇到报错(编译内核错误)

编译内核错误
..
Loading new nvidia/575.57.08 DKMS files...
Building for 6.6.51+rpt-rpi-2712, 6.6.51+rpt-rpi-v8, 6.12.25+rpt-rpi-2712 and 6.12.25+rpt-rpi-v8

Building initial module nvidia/575.57.08 for 6.6.51+rpt-rpi-2712
The kernel is built without module signing facility, modules won't be signed

Building module(s)...........(bad exit status: 2)
Failed command:
'make' -j4 KERNEL_UNAME=6.6.51+rpt-rpi-2712 IGNORE_CC_MISMATCH=1 SYSSRC=/lib/modules/6.6.51+rpt-rpi-2712/build LD=/usr/bin/ld.bfd
 CONFIG_X86_KERNEL_IBT= modules

Error! Bad return status for module build on kernel: 6.6.51+rpt-rpi-2712 (aarch64)
Consult /var/lib/dkms/nvidia/575.57.08/build/make.log for more information.
dpkg: error processing package nvidia-dkms-575 (--configure):
 installed nvidia-dkms-575 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of nvidia-driver-575:
 nvidia-driver-575 depends on nvidia-dkms-575 (= 575.57.08-0ubuntu1); however:
  Package nvidia-dkms-575 is not configured yet.

dpkg: error processing package nvidia-driver-575 (--configure):
 dependency problems - leaving unconfigured
Setting up libglx-mesa0:arm64 (24.2.8-1~bpo12+rpt3) ...
Setting up libglx0:arm64 (1.6.0-1) ...
dpkg: dependency problems prevent configuration of cuda-drivers-575:
 cuda-drivers-575 depends on nvidia-driver-575 (>= 575.57.08) | nvidia-driver-575-open (>= 575.57.08) | nvidia-driver-575-server
(>= 575.57.08) | nvidia-driver-575-server-open (>= 575.57.08); however:
  Package nvidia-driver-575 is not configured yet.
  Package nvidia-driver-575-open is not installed.
  Package nvidia-driver-575-server is not installed.
  Package nvidia-driver-575-server-open is not installed.

dpkg: error processing package cuda-drivers-575 (--configure):
 dependency problems - leaving unconfigured
Setting up libgl1:arm64 (1.6.0-1) ...
dpkg: dependency problems prevent configuration of cuda-drivers:
 cuda-drivers depends on cuda-drivers-575 (= 575.57.08-0ubuntu1); however:
  Package cuda-drivers-575 is not configured yet.

dpkg: error processing package cuda-drivers (--configure):
 dependency problems - leaving unconfigured
...

检查错误日志 /var/lib/dkms/nvidia/575.57.08/build/make.log 可以看到,显示缺少 stdarg.h :

编译错误日志
CONFTEST: ib_peer_memory_symbols
  CC [M]  /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-platform.o
  CC [M]  /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-dsi-parse-panel-props.o
  CC [M]  /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-bpmp.o
  CC [M]  /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-gpio.o
In file included from /var/lib/dkms/nvidia/575.57.08/build/common/inc/conftest.h:28,
                 from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:29,
                 from /var/lib/dkms/nvidia/575.57.08/build/common/inc/os-interface.h:40,
                 from /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-dsi-parse-panel-props.c:26:
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:77:2: error: #error dma_buf_export() conftest failed!
   77 | #error dma_buf_export() conftest failed!
      |  ^~~~~
In file included from /var/lib/dkms/nvidia/575.57.08/build/common/inc/conftest.h:28,
                 from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:29,
                 from /var/lib/dkms/nvidia/575.57.08/build/common/inc/os-interface.h:40,
                 from /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-bpmp.c:26:
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:77:2: error: #error dma_buf_export() conftest failed!
   77 | #error dma_buf_export() conftest failed!
      |  ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:94:2: error: #error radix_tree_replace_slot() conftest failed!
   94 | #error radix_tree_replace_slot() conftest failed!
      |  ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:94:2: error: #error radix_tree_replace_slot() conftest failed!
   94 | #error radix_tree_replace_slot() conftest failed!
      |  ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:33:14: fatal error: stdarg.h: No such file or directory
   33 |     #include <stdarg.h>
      |              ^~~~~~~~~~
/var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:33:14: fatal error: stdarg.h: No such file or directory
   33 |     #include <stdarg.h>
      |              ^~~~~~~~~~
compilation terminated.
compilation terminated.
make[3]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/scripts/Makefile.build:248: /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-dsi-parse-panel-props.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[3]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/scripts/Makefile.build:248: /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-bpmp.o] Error 1
In file included from /var/lib/dkms/nvidia/575.57.08/build/common/inc/conftest.h:28,
                 from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:29,
                 from /var/lib/dkms/nvidia/575.57.08/build/common/inc/os-interface.h:40,
                 from /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-gpio.c:26:
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:77:2: error: #error dma_buf_export() conftest failed!
   77 | #error dma_buf_export() conftest failed!
      |  ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:94:2: error: #error radix_tree_replace_slot() conftest failed!
   94 | #error radix_tree_replace_slot() conftest failed!
      |  ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:33:14: fatal error: stdarg.h: No such file or directory
   33 |     #include <stdarg.h>
      |              ^~~~~~~~~~
compilation terminated.
make[3]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/scripts/Makefile.build:248: /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-gpio.o] Error 1
In file included from /var/lib/dkms/nvidia/575.57.08/build/common/inc/conftest.h:28,
                 from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:29,
                 from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv.h:41,
                 from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv-linux.h:28,
                 from /var/lib/dkms/nvidia/575.57.08/build/common/inc/nv-platform.h:27,
                 from /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-platform.c:32:
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:77:2: error: #error dma_buf_export() conftest failed!
   77 | #error dma_buf_export() conftest failed!
      |  ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/conftest/functions.h:94:2: error: #error radix_tree_replace_slot() conftest failed!
   94 | #error radix_tree_replace_slot() conftest failed!
      |  ^~~~~
/var/lib/dkms/nvidia/575.57.08/build/common/inc/nv_stdarg.h:33:14: fatal error: stdarg.h: No such file or directory
   33 |     #include <stdarg.h>
      |              ^~~~~~~~~~
compilation terminated.
make[3]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/scripts/Makefile.build:248: /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv-platform.o] Error 1
make[2]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/Makefile:1946: /var/lib/dkms/nvidia/575.57.08/build] Error 2
make[1]: *** [/usr/src/linux-headers-6.6.51+rpt-common-rpi/Makefile:246: __sub-make] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-6.6.51+rpt-rpi-2712'
make: *** [Makefile:140: modules] Error 2

这里提示 stdarg.h: No such file or directory ,看起来似乎指gcc自带的头文件: /usr/lib/gcc/aarch64-linux-gnu/12/include/stdarg.h

#include <stdarg.h> missing in 418.113 #46 提到不同内核版本需要修订

修订报错代码的include部分
#if LINUX_VERSION_CODE < KERNEL_VERSION(5, 15, 0)
#include <stdarg.h>
#else
#include <linux/stdarg.h>
#endif

检查 stdarg.h 文件在哪里:

查找 stdarg.h
find /usr -name stdarg.h 2>/dev/null

可以看到linux头文件确实在 linux/stdarg.h :

查找 stdarg.h 输出显示 linux/stdarg.h
/usr/include/c++/12/tr1/stdarg.h
/usr/lib/gcc/aarch64-linux-gnu/12/include/stdarg.h
/usr/src/linux-headers-6.12.34+rpt-common-rpi/include/linux/stdarg.h
/usr/src/linux-headers-6.12.25+rpt-common-rpi/include/linux/stdarg.h

参考 nvidia installer can't find stdarg.h #6 建议在源代码header目录下创建 <linux/stdarg.h>stdarg.h> 的软连接,感觉这个方法也好:

创建 linux/stdarg.hstdarg.h 的软连接
cd /usr/src/linux-headers-6.12.34+rpt-common-rpi/include/
ln -s linux/stdarg.h stdarg.h

cd /usr/src/linux-headers-6.12.25+rpt-common-rpi/include/
ln -s linux/stdarg.h stdarg.h

然后重新安装,则这个找不到 stdarg.h 的问题解决了(虽然还是有其他报错)

cc1: some warnings being treated as errors

在解决了 stdarg.h 无法找到的问题之后,编译日志中出现大量报错,其中有很多行显示:

cc1: some warnings being treated as errors

考虑到是不是WARNING被视为ERROR导致编译不通过,所以想修订 makeCFLAGS 配置。在 Gentoo Linux 中,有一个全局的 /etc/make.conf 配置可以设置Gentoo的编译参数,那么Debian如何设置呢?

CFLAGS=" -Wno-error=..."

makefile提供了一个参数可以将某些warning不视为error,举例刚才的编译日志中,很多WARNONG是 -Wmisssing-prototypes ,所以我需要忽略这个WARNING

则应该在 CFLAGS 中添加 -Wno-error=missing-prototypes

修订配置的方法通常是在项目软件目录下修改 makefile ,例如:

修改makefile添加CFLAGS
# 这里忽略所有WARNING只是一个案例,实际并不推荐忽略所有WARNING
override CFLAGS += -Wall

app: main.c
    gcc $(CFLAGS) -o app main.c 

dpkg-buildflags

dpkg-buildflags 在package编译时返回build的flags,默认的配置定义在 /usr/local/etc/dpkg/buildflags.conf 。对应当前用户则是 $XDG_CONFIG_HOME/dpkg/buildflags.conf (默认的 $XDG_CONFIG_HOME 就是 $HOME/.config ,也就是当前用户的配置是 ~/.config/dpkg/buildflags.conf

不过,我发现我的情况不是对软件源代码包进行编译,具体参考 How to override dpkg-buildflags CFLAGS? 。只有 apt-get source <pkg-name> 下载软件源代码包才使用这个 dpkg-buildflags 方法

修订 动态内核模块支持(DKMS) 编译参数

注意到这个编译模块是 动态内核模块支持(DKMS) ,参考 building with clang rather than gcc #124 ,对于 dkms ,会使用一个 dkms.conf 来控制编译。所以我搜索了一下,发现在 /usr/src/nvidia-575.57.08/dkms.conf 但我没有找到修改方法参考

不过, grep 了一下 /usr/src/nvidia-575.57.08 源代码目录,发现在该目录下有一个 Kbuild 文件包含了 -Wno-error 配置,当前配置是:

当前 Kbuild 配置 CFLAGS
...
NV_CONFTEST_CFLAGS = $(NV_CFLAGS_FROM_CONFTEST) $(ccflags-y) -fno-pie
NV_CONFTEST_CFLAGS += $(filter -std=%,$(KBUILD_CFLAGS))
NV_CONFTEST_CFLAGS += $(call cc-disable-warning,pointer-sign)
NV_CONFTEST_CFLAGS += $(call cc-option,-fshort-wchar,)
NV_CONFTEST_CFLAGS += $(call cc-option,-Werror=incompatible-pointer-types,)
NV_CONFTEST_CFLAGS += -Wno-error
...

参考