.. _priv_deploy_etcd_cluster_with_tls_auth: ============================ 私有云部署TLS认证的etcd集群 ============================ 在部署 :ref:`k3s` 集群,我采用了 :ref:`deploy_etcd_cluster_with_tls_auth` 。在这个基础上,我部署 :ref:`priv_etcd` 时,采用本文方法实践。 实践环境 ========== 服务器依然是 :ref:`priv_etcd_tls` 中使用的 通过 :ref:`priv_kvm` 构建3台虚拟机: .. csv-table:: 私有云KVM虚拟机 :file: priv_etcd_tls/hosts.csv :widths: 40, 60 :header-rows: 1 TLS证书 ========== TLS证书采用 ``cfssl`` 工具构建,完整步骤见 :ref:`priv_etcd_tls` 。分别获得: - CA:: ca-key.pem ca.csr ca.pem - 服务器证书:: server-key.pem server.csr server.pem - 点对点证书:: z-b-data-1-key.pem z-b-data-1.csr z-b-data-1.pem z-b-data-2.csr z-b-data-2.json z-b-data-2.pem z-b-data-3.csr z-b-data-3.json z-b-data-3.pem - 客户端证书:: client-key.pem client.csr client.pem 安装软件包 =========== 采用 :ref:`install_run_local_etcd` 中安装脚本下载最新安装软件包(当前版本 ``3.5.4`` ) .. literalinclude:: install_run_local_etcd/install_etcd.sh :language: bash :caption: 下载etcd的linux版本脚本 install_etcd.sh - 在安装节点创建 etcd 目录以及用户和用户组(如果使用了 :ref:`priv_lvm` 中构建的 ``lv-etcd`` 卷,则忽略目录创建): .. literalinclude:: deploy_etcd_cluster/useradd_etcd :language: bash :caption: useradd添加etcd用户账号 证书分发 ========= - 为方便使用ssh/scp进行管理,首先采用 :ref:`ssh_key` 的 :ref:`ssh-agent_profile` 结合 :ref:`ssh_multiplexing` ,这样可以不必输入密码就可以ssh/scp到集群服务器 - 配置好 :ref:`edge_cloud_infra` 的 :ref:`dnsmasq_domains_for_subnets` ,提供正确的域名解析,这样后面配置 ``etcd`` 可以正确获得主机名解析 - 使用 :ref:`etcd_tls` 方法完成上述证书生成后,使用以下脚本进行分发: .. literalinclude:: priv_deploy_etcd_cluster_with_tls_auth/deploy_etcd_certificates.sh :language: bash :caption: 分发证书脚本 deploy_etcd_certificates.sh 执行脚本:: sh deploy_etcd_certificates.sh 这样在 ``etcd`` 主机上分别有对应主机的配置文件 ``/etc/etcd`` 目录下 配置启动服务脚本 =================== :ref:`systemd` 启动etcd脚本 ---------------------------- - 可以通过以下命令获得环境变量,不过后面我在构建配置文件时会结合这个环境变量 .. literalinclude:: priv_deploy_etcd_cluster_with_tls_auth/etcd_env :language: bash :caption: etcd环境变量 .. note:: 实际上很多网上部署案例或者生产环境部署etcd都不采用配置文件,而是通过命令行参数来调整 ``etcd`` 运行特性。根据 :ref:`etcd_config_rule` ,配置文件优先级最高,所以我采用所有调整参数都以配置文件为准,无命令行参数。 .. note:: ``etcd`` 配置中有一些变量,我之前在 :ref:`deploy_etcd_cluster_with_tls_auth` 采用模版中占用符方式,然后通过 ``sed`` 去替换。虽然可行,但是不够优雅。更为简洁方便的方式是采用SHELL的 :ref:`here_document` 特性,通过一些环境变量自动从脚本中替换好变量,本文即采用此方法。 - 执行脚本 ``generate_etcd_service`` 生成 ``/etc/etcd/conf.yml`` 配置文件和 :ref:`systemd` 启动 ``etcd`` 配置文件 ``/lib/systemd/system/etcd.service`` : .. literalinclude:: priv_deploy_etcd_cluster_with_tls_auth/generate_etc_config_systemd :language: bash :caption: 创建etcd启动的配置conf.yml 和 systemd脚本 .. note:: 配置文件 ``conf.yml`` 和之前实践 :ref:`deploy_etcd_cluster_with_tls_auth` 相同,只不过我采用了 :ref:`here_document` 来完成变量替换。这里无需再手工编辑配置 .. note:: 配置文件 ``conf.yml`` 中,初始化etcd绑定的url必须使用主机的IP地址,不能使用域名:: # List of comma separated URLs to listen on for peer traffic. listen-peer-urls: https://192.168.6.204:2380 # List of comma separated URLs to listen on for client traffic. listen-client-urls: https://192.168.6.204:2379,https://127.0.0.1:2379 如果使用域名,如 ``z-b-data-1.staging.huatai.me`` (即使域名解析正确),启动etcd还是会报错:: {"level":"warn","ts":1649611005.237307,"caller":"etcdmain/etcd.go:74","msg":"failed to verify flags","error":"expected IP in URL for binding (http://z-b-data-1.staging.huatai.me:2380)"} - 激活服务:: sudo systemctl enable etcd.service - 启动服务:: sudo systemctl start etcd.service 问题排查 ============ - 检查启动失败原因:: sudo systemctl status etcd.service sudo journalctl -xe 初始化集群未找到对应节点名 --------------------------- 启动日志:: Jul 03 00:41:30 z-b-data-1 etcd[13044]: {"level":"fatal","ts":"2022-07-03T00:41:30.498+0800","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"couldn't find local name \"z-b-data-1\" in the initial cluster configuration","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/re> Jul 03 00:41:30 z-b-data-1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE"} 仔细核对了配置( 受到 `couldn't find local name "" in the initial cluster configuration when start etcd service `_ 启发 ),原来在配置文件中有一行:: # Initial cluster configuration for bootstrapping. initial-cluster: NODE1=https://192.168.6.204:2380,NODE2=https://192.168.6.205:2380,NODE3=https://192.168.6.206:2380 这行配置错误了,必须修订成:: initial-cluster: z-b-data-1=https://192.168.6.204:2380,z-b-data-2=https://192.168.6.205:2380,z-b-data-3=https://192.168.6.206:2380 这样才能和配置文件中最初的:: # Human-readable name for this member. name: z-b-data-1 对应起来。也就是必须告知 ``initial-cluster`` , ``z-b-data-1`` 对应的是那个服务器配置,这里就是 ``https://192.168.6.204:2380`` unmarshaling JSON --------------------- 启动日志报错:: Jul 03 20:54:56 z-b-data-1 etcd[266420]: {"level":"warn","ts":"2022-07-03T20:54:56.212+0800","caller":"etcdmain/etcd.go:75","msg":"failed to verify flags","error":"error unmarshaling JSON: while decoding JSON: json: cannot unmarshal string into Go struct field configYAML.log-outputs of type []string"} Jul 03 20:54:56 z-b-data-1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE -- Subject: Unit process exited 上述报错 ``error unmarshaling JSON: while decoding JSON`` 在很多yaml配置错误时候就会出现,例如 `Concourse get bitbucket resource error unmarshaling JSON: while decoding JSON `_ 不过,我经过实践检查发现,原来我配置了:: # Specify 'stdout' or 'stderr' to skip journald logging even when running under systemd. #log-outputs: [stderr] log-outputs: /var/log/etcd/etcd.log 是错误的,需要恢复为:: # Specify 'stdout' or 'stderr' to skip journald logging even when running under systemd. log-outputs: [stderr] .. note:: 使用 :ref:`systemd` 管理 ``etcd`` ,日志可以通过 :ref:`journalctl` 来观察 检查 =========== - 启动 ``etcd`` 之后,检查服务进程:: ps aux | grep etcd 可以看到:: etcd 8556 2.1 0.2 11214264 39296 ? Ssl 22:02 0:02 /usr/local/bin/etcd --config-file=/etc/etcd/conf.yml - 检查日志:: journalctl -u etcd.service 验证etcd集群 =============== - 为方便维护,配置 ``etcdctl`` 环境变量,添加到用户自己的 profile中: .. literalinclude:: priv_deploy_etcd_cluster_with_tls_auth/etcdctl_env :language: bash :caption: etcdctl 使用的环境变量 然后可以检查: .. literalinclude:: deploy_etcd_cluster_with_tls_auth/etcdctl_member_list :language: bash :caption: etcdctl 检查集群成员列表(member list) 输出类似:: 64e2be2269f59c43, started, z-b-data-3, https://192.168.6.206:2380, https://192.168.6.206:2379, false 73d6903628b74671, started, z-b-data-1, https://192.168.6.204:2380, https://192.168.6.204:2379, false cbea9b1cda087dbf, started, z-b-data-2, https://192.168.6.205:2380, https://192.168.6.205:2379, false 为方便观察,可以使用表格输出模式: .. literalinclude:: deploy_etcd_cluster_with_tls_auth/etcdctl_endpoint_status :language: bash :caption: etcdctl 检查endpoint状态(表格形式输出) 输出显示:: +-------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://etcd.staging.huatai.me:2379 | 73d6903628b74671 | 3.5.4 | 20 kB | true | false | 2 | 22 | 22 | | +-------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ 注意,每次刷新显示的 ``ID`` 是轮转变化的,分别是3个实际的etcd节点。这种方式观察不是很方便(同时观察3个节点状态) 检查健康状况: .. literalinclude:: deploy_etcd_cluster_with_tls_auth/etcdctl_endpoint_health :language: bash :caption: etcdctl 检查endpoint健康状态(查看节点响应情况) 输出显示:: https://etcd.staging.huatai.me:2379 is healthy: successfully committed proposal: took = 12.150298ms 调整 ``ETCDCTL_ENDPOINTS`` ----------------------------- 你有没有发现,上面检查 ``etcdctl`` 的 ``endpoint status`` 输出只显示了DNS域名对应的状态,而实际上 ``etcd.staging.huatai.me`` 对应了3个服务器的IP(实际采用了DNSRR负载均衡)。那么怎么显示出所有的节点状态呢? 关键点是 ``ETCDCTL_ENDPOINTS`` 环境变量,将 ``https://etcd.staging.huatai.me:2379`` 调整为实际的服务器节点: .. literalinclude:: priv_deploy_etcd_cluster_with_tls_auth/etcdctl_endpoint_env :language: bash :caption: ETCDCTL_ENDPOINTS 环境变量 :emphasize-lines: 2,3 现在就可以检查etcd集群的所有节点: - 检查节点状态: .. literalinclude:: deploy_etcd_cluster_with_tls_auth/etcdctl_endpoint_status :language: bash :caption: etcdctl 检查endpoint状态(表格形式输出) 现在就可以看到3个节点状态的详细信息:: +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.6.204:2379 | 73d6903628b74671 | 3.5.4 | 20 kB | true | false | 2 | 57 | 57 | | | https://192.168.6.205:2379 | cbea9b1cda087dbf | 3.5.4 | 20 kB | false | false | 2 | 57 | 57 | | | https://192.168.6.206:2379 | 64e2be2269f59c43 | 3.5.4 | 20 kB | false | false | 2 | 57 | 57 | | +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ - 检查节点健康状况: .. literalinclude:: deploy_etcd_cluster_with_tls_auth/etcdctl_endpoint_health :language: bash :caption: etcdctl 检查endpoint健康状态(查看节点响应情况) 现在也能看到3个节点健康情况:: +----------------------------+--------+-------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +----------------------------+--------+-------------+-------+ | https://192.168.6.204:2379 | true | 10.114539ms | | | https://192.168.6.205:2379 | true | 10.327062ms | | | https://192.168.6.206:2379 | true | 10.631616ms | | +----------------------------+--------+-------------+-------+ - (重要步骤)由于 ``etcd`` 已经完成部署,之前在 ``/etc/etcd/conf.yml`` 配置集群状态,需要从 ``new`` 改为 ``existing`` ,表明集群已经建设完成:: # Initial cluster state ('new' or 'existing'). initial-cluster-state: 'existing' 后续系统重启,etcd重启就会按照已经建成的etcd来运行,不用再进行初始化 参考 ====== - `etcd Clustering Guide `_ - `Setting up Etcd Cluster with TLS Authentication Enabled `_ 这篇文档非常详细指导了如何使用cfssl工具来生成etcd服务器证书,以及签名客户端证书 - `Deploy a secure etcd cluster `_ - `How To Setup a etcd Cluster On Linux – Beginners Guide `_ 提供了一个生成 :ref:`systemd` 配置的脚本 - `How to check Cluster status `_