树莓派5 NVMe存储ZFS

树莓派5 NVMe存储ZFS磁盘准备

我在 树莓派软件定义存储集群 方案中采用了3台 树莓派Raspberry Pi 5 ,每台 树莓派Raspberry Pi 5 配置了一个 铠侠KIOXIA EXCERIA G2 NVMe SSD存储 2TB 规格存储,按照 树莓派软件定义存储集群 规划划分磁盘:

树莓派5模拟集群NVMe存储分区

分区

挂载

大小

文件系统

说明

1

/boot/firmware

512M

fat32

EFI启动分区

2

/

59G

ext4

操作系统根分区

3

1024G

ceph专用bluestore存储

4

/var/lib/docker

剩余空间

zfs

zpool-data存储池

  • 使用 fdisk 对当前磁盘分区进行检查,可以看到目前只有 Raspbery Pi OS(Raspbian) 使用的2个分区(之所以使用 fdisk 而没有使用 parted分区工具 是因为 fdisk 默认使用 MiB/GiB/TiB 来计算容量,也就是 10241k 计算;而 parted分区工具 默认使用 MB/GB/TB 计算容量,即以 10001k 计算。我纯粹是为了更贴近程序员习惯,轻度强迫症):

fdisk -l 显示当前分区信息
fdisk -l /dev/nvme0n1
可以看到当前分区
Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: KIOXIA-EXCERIA G2 SSD                   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x57a11afa

Device         Boot   Start       End   Sectors  Size Id Type
/dev/nvme0n1p1         8192   1056767   1048576  512M  c W95 FAT32 (LBA)
/dev/nvme0n1p2      1056768 124735487 123678720   59G 83 Linux

Docker处理

我的实践案例在这里有一个插入步骤,是因为我已经在 树莓派Raspberry Pi OS(64位)安装Docker ,所以需要先备份导出镜像,然后停止docker,移除 /var/lib/docker 目录。这样能够为后续 Docker ZFS 存储驱动 腾出 zpool 挂载目录。

无需Docker Registry传输Docker镜像 步骤一: 备份

我后续准备 Kubernetes部署registry仓库 ,所以当前Docker环境没有部署镜像仓库。这种情况下,切换 Docker ZFS 存储驱动 要保障镜像和容器能够恢复,需要使用 无需Docker Registry传输Docker镜像 :

导出docker中需要保存的容器镜像
docker commit acloud-dev local:acloud-dev
docker save -o ~/acloud-dev.tar local:acloud-dev

docker挂载分区卸载

  • 停止Docker:

停止Docker服务,为存储驱动修改做准备
sudo systemctl stop docker
sudo systemctl stop docker.socket
  • /var/lib/docker 备份并清理该目录下所有内容:

备份/var/lib/docker目录
sudo cp -au /var/lib/docker /var/lib/docker.bk
sudo rm -rf /var/lib/docker

备注

切换 Docker ZFS 存储驱动 后实际镜像数据需要通过类似 无需Docker Registry传输Docker镜像 进行备份和恢复

磁盘分区

警告

我再次强调一下:

为了节约磁盘,只在我的 树莓派软件定义存储集群 构建了一个 zpool-data 存储池,提供给 Docker / KVM 以及本地数据存储。这个存储磁盘划分是基于以前的实践 Gentoo上运行ZFS(xcloud)

备注

有一点强迫症: 为了能够完整分出 1 TiB 分区,我使用了 fdisk 来处理磁盘(我暂时不知道如何在 :ref;`parted` 中精确划分出 1024 GiB 这样的空间)

fdisk 磁盘分区,为 CephZFS 分别准备分区
# 为Ceph准备一个1TB分区,命名为BlueStore
# 我不知道如何MiB或GiB为单位(也就是1024作为1k)
# 所以实际是从fdisk创建1024G,并且树莓派使用的是msdos分区

# fdisk /dev/nvme0n1                                                
Welcome to fdisk (util-linux 2.38.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

This disk is currently in use - repartitioning is probably a bad idea.
It's recommended to umount all file systems, and swapoff all swap partitions on this disk.

Command (m for help): p

Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: KIOXIA-EXCERIA G2 SSD
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x57a11afa

Device         Boot     Start        End    Sectors   Size Id Type
/dev/nvme0n1p1           8192    1056767    1048576   512M  c W95 FAT32 (LBA)
/dev/nvme0n1p2        1056768  124735487  123678720    59G 83 Linux

Command (m for help): n
Partition type
   p   primary (2 primary, 0 extended, 2 free)
   e   extended (container for logical partitions)
Select (default p): p
Partition number (3,4, default 3): <输入回车,默认值>
First sector (2048-3907029167, default 2048): 124735488
Last sector, +/-sectors or +/-size{K,M,G,T,P} (124735488-3907029167, default 3907029167): +1024G

Created a new partition 3 of type 'Linux' and of size 1 TiB.

Command (m for help): p
Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: KIOXIA-EXCERIA G2 SSD                   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x57a11afa

Device         Boot     Start        End    Sectors  Size Id Type
/dev/nvme0n1p1           8192    1056767    1048576  512M  c W95 FAT32 (LBA)
/dev/nvme0n1p2        1056768  124735487  123678720   59G 83 Linux
/dev/nvme0n1p3      124735488 2272219135 2147483648    1T 83 Linux

Command (m for help): n
Partition type
   p   primary (3 primary, 0 extended, 1 free)
   e   extended (container for logical partitions)
Select (default e): p

Selected partition 4
First sector (2048-3907029167, default 2048): 2272219136
Last sector, +/-sectors or +/-size{K,M,G,T,P} (2272219136-3907029167, default 3907029167): 输入回车(默认值) 

Created a new partition 4 of type 'Linux' and of size 779.5 GiB.

Command (m for help): p

Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: KIOXIA-EXCERIA G2 SSD                    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x57a11afa

Device         Boot      Start        End    Sectors   Size Id Type
/dev/nvme0n1p1            8192    1056767    1048576   512M  c W95 FAT32 (LBA)
/dev/nvme0n1p2         1056768  124735487  123678720    59G 83 Linux
/dev/nvme0n1p3       124735488 2272219135 2147483648     1T 83 Linux
/dev/nvme0n1p4      2272219136 3907029167 1634810032 779.5G 83 Linux

Command (m for help): w
The partition table has been altered.
Syncing disks.

现在再次执行 fdisk -l /dev/nvme0n1 可以看到增加了2个分区:

fdisk 检查可以看到增加了2个分区,分别用于 CephZFS
Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: KIOXIA-EXCERIA G2 SSD                   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x57a11afa

Device         Boot      Start        End    Sectors   Size Id Type
/dev/nvme0n1p1            8192    1056767    1048576   512M  c W95 FAT32 (LBA)
/dev/nvme0n1p2         1056768  124735487  123678720    59G 83 Linux
/dev/nvme0n1p3       124735488 2272219135 2147483648     1T 83 Linux
/dev/nvme0n1p4      2272219136 3907029167 1634810032 779.5G 83 Linux
检查各个分区是否4k对齐
for i in {1..4};do parted /dev/nvme0n1 align-check opt $i;done

输出显示每个分区都已经实现对齐( aligned ):

检查各个分区是否4k对齐
1 aligned
2 aligned
3 aligned
4 aligned

ZFS存储构建

  • ZFS存储池和挂载构建非常简单:

构建 zpool-data 存储池并挂载
zpool create -f zpool-data -m /var/lib/docker /dev/nvme0n1p4
zfs set compression=lz4 zpool-data
  • 完成后检查 df -h :

检查ZFS存储挂载
Filesystem      Size  Used Avail Use% Mounted on
udev            3.8G     0  3.8G   0% /dev
tmpfs           806M  5.4M  800M   1% /run
/dev/nvme0n1p2   59G   18G   38G  32% /
tmpfs           4.0G     0  4.0G   0% /dev/shm
tmpfs           5.0M   48K  5.0M   1% /run/lock
/dev/nvme0n1p1  510M   65M  446M  13% /boot/firmware
tmpfs           806M     0  806M   0% /run/user/1000
zpool-data      752G  128K  752G   1% /var/lib/docker

设置 Docker ZFS 存储驱动

  • 修改 /etc/docker/daemon.json 添加zfs配置项(如果该配置文件不存在则创建并添加如下内容):

/etc/docker/daemon.json 添加ZFS存储引擎配置
{
  "storage-driver": "zfs"
}
  • 启动Docker并检查Docker配置:

启动Docker并检查 docker info
sudo systemctl start docker
sudo docker info

docker info 输出显示如下:

docker info 输出
Client: Docker Engine - Community
 Version:    27.3.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.17.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.7
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 27.3.1
 Storage Driver: zfs
  Zpool: zpool-data
  Zpool Health: ONLINE
  Parent Dataset: zpool-data
  Space Used By Parent: 146944
  Space Available: 807319486976
  Parent Quota: no
  Compression: lz4
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc version: v1.1.14-0-g2c9f560
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.6.51+rpt-rpi-2712
 Operating System: Debian GNU/Linux 12 (bookworm)
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 7.864GiB
 Name: acloud-w1
 ID: 46d21b0f-cbc9-48f5-88a6-464682c06107
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http://127.0.0.1:3128
 HTTPS Proxy: http://127.0.0.1:3128
 No Proxy: *.baidu.com,192.168.0.0/16,10.0.0.0/8,
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No memory limit support
WARNING: No swap limit support
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

docker info 警告 解决方法 Docker 安装后调整快速起步

无需Docker Registry传输Docker镜像 步骤二: 恢复

  • 备份的镜像复制到需要恢复的主机上进行加载

加载保存的容器镜像
# acloud-dev.tar 复制到需要恢复的主机上进行加载
docker load -i ~/acloud-dev.tar
  • 恢复容器运行:

运行包含开发环境的ARM环境debian镜像
docker run -dt --name acloud-dev --hostname acloud-dev \
    -p 1122:22 \
    -p 13000:3000 \
    -p 18080:8080 \
    -p 14000:4000 \
    -p 1180:80 \
    -p 1443:443 \
    -v /home/admin/secrets:/home/admin/.ssh \
    -v /home/admin/docs:/home/admin/docs \
    acloud-dev

# 如果需要在运行时注入环境变量,则添加类似如下参数(添加代理案例)
#    -e HTTP_PROXY=http://172.17.0.1:3128 \
#    -e HTTPS_PROXY=http://172.17.0.1:3128 \
#    -e NO_PROXY=localhost,127.0.0.1,*.baidu.com,192.168.0.0/16,10.0.0.0/8 \