使用 `nvidia-smi` 工具检查NVIDIA NVLink

NVIDIA NVLink 是NVIDIA公司开发的GPU卡通讯互联接口(协议)，在高端数据中心GPU卡中使用。

查看 nvidia-smi nvlink -h 帮助:

nvidia-smi nvlink -h 提供基本帮助信息，可以快速了解功能

    nvlink -- Display NvLink information.

    Usage: nvidia-smi nvlink [options]

    Options include:
    [-h | --help]: Display help information
    [-i | --id]: Enumeration index, PCI bus ID or UUID.

    [-l | --link]: Limit a command to a specific link.  Without this flag, all link information is displayed.
    [-s | --status]: Display link state (active/inactive).
    [-c | --capabilities]: Display link capabilities.
    [-p | --pcibusid]: Display remote node PCI bus ID for a link.
    [-R | --remotelinkinfo]: Display remote device PCI bus ID and NvLink ID for a link.
    [-sc | --setcontrol]: Setting counter control is deprecated!
    [-gc | --getcontrol]: Getting counter control is deprecated!
    [-g | --getcounters]: Getting counters using option -g is deprecated.
Please use option -gt/--getthroughput instead.
    [-r | --resetcounters]: Resetting counters is deprecated!
    [-e | --errorcounters]: Display error counters for a link.
    [-ec | --crcerrorcounters]: Display per-lane CRC error counters for a link.
    [-re | --reseterrorcounters]: Reset all error counters to zero.
    [-gt | --getthroughput]: Display link throughput counters for specified counter type
       The arguments consist of character string representing the type of traffic counted:
          d: Display tx and rx data payload in KiB
          r: Display tx and rx data payload and protocol overhead in KiB if supported

查看 GPU 0 (通常服务器会安装多块GPU卡) NVIDIA计算卡的 NVLink 状态:

检查GPU 0的NVLink状态

nvidia-smi nvlink -s -i 0

# 或者使用
nvidia-smi nvlink --status -i 0

检查GPU 0的NVLink状态输出案例

GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-c4fe8563-32db-1de5-ffb5-cab9c0cd8a05)
	 Link 0: 25 GB/s
	 Link 1: 25 GB/s
	 Link 2: 25 GB/s
	 Link 3: 25 GB/s
	 Link 4: 25 GB/s
	 Link 5: 25 GB/s
	 Link 6: 25 GB/s
	 Link 7: 25 GB/s
	 Link 8: 25 GB/s
	 Link 9: 25 GB/s
	 Link 10: 25 GB/s
	 Link 11: 25 GB/s

查看 GPU 0 卡的NVLink功能:

检查GPU 0的NVLink功能

nvidia-smi nvlink -c -i 0

# 或者使用
nvidia-smi nvlink --capabilities -i 0

检查GPU 0的NVLink功能输出案例

GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-c4fe8563-32db-1de5-ffb5-cab9c0cd8a05)
	 Link 0, P2P is supported: true
	 Link 0, Access to system memory supported: true
	 Link 0, P2P atomics supported: true
	 Link 0, System memory atomics supported: true
	 Link 0, SLI is supported: true
	 Link 0, Link is supported: false
	 Link 1, P2P is supported: true
	 Link 1, Access to system memory supported: true
	 Link 1, P2P atomics supported: true
	 Link 1, System memory atomics supported: true
	 Link 1, SLI is supported: true
	 Link 1, Link is supported: false
	 ...
	 Link 11, P2P is supported: true
	 Link 11, Access to system memory supported: true
	 Link 11, P2P atomics supported: true
	 Link 11, System memory atomics supported: true
	 Link 11, SLI is supported: true
	 Link 11, Link is supported: false

关键命令: 检查 GPU 0 卡的NVLink链路数据传输计数(可用于构建Prometheus监控NVIDIA NVLink )

检查GPU 0的NVLink数据传输

nvidia-smi nvlink -gt d -i 0

# 或者使用
nvidia-smi nvlink --getthroughput d -i 0

备注

nvlink --getthroughput 有2个子参数:

d 实际传输的数据负载(KiB)，也就是剥离了传输协议部分的真实数据量
r 包括协议负载和数据负载的传输总数据量(KiB)

检查GPU 0的NVLink数据传输输出案例

GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-c4fe8563-32db-1de5-ffb5-cab9c0cd8a05)
	 Link 0: Data Tx: 435831587298 KiB
	 Link 0: Data Rx: 309569188699 KiB
	 Link 1: Data Tx: 435821606019 KiB
	 Link 1: Data Rx: 309581969078 KiB
     ...
	 Link 11: Data Tx: 435989409595 KiB
	 Link 11: Data Rx: 311512294871 KiB

参考

Exploring NVIDIA NVLink "nvidia-smi" Commands

使用 nvidia-smi 工具检查NVIDIA NVLink

参考

使用 `nvidia-smi` 工具检查NVIDIA NVLink