关闭NUMA优化LLaMA推理速度

安装 numactl 后检查当前系统NUMA:

检查当前系统NUMA状态

numactl --hardware

可以看到输出中显示默认有2个NUMA节点:

检查当前系统NUMA状态可以看到有2个节点

node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 386838 MB
node 0 free: 386209 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 387068 MB
node 1 free: 386552 MB
node distances:
node     0    1 
   0:   10   21 
   1:   21   10

可以看到每个NUMA节点使用了一半系统内存

由于推理需要访问整个内存空间，当出现跨节点访问内存时，NUMA会影响性能。在 768GB内存，双AMD EPYC处理器本地化部署DeepSeek-R1的讨论(X) 特别提到了一个重要提示: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don't forget!

所以在BIOS中设置 NUMA=interleave ，然后重启系统后再次检查NUMA

检查当前系统NUMA状态

numactl --hardware

此时看到NUMA节点只有一个，也就是所有处理器访问相同的全部内存:

检查当前系统NUMA状态可以看到有1个节点(node 0)

available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 773522 MB
node 0 free: 773123 MB
node distances:
node     0 
   0:   10

运行

./llama.cpp/build/bin/llama-server \
  --model unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \
  --cache-type-k q8_0 \
  --port 8081

运行结果:

关闭NUMA之后推理速度达到 1.063 tokens/s

slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 36
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 36, n_tokens = 36, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 36, n_tokens = 36
slot      release: id  0 | task 0 | stop processing: n_past = 1250, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =   10574.93 ms /    36 tokens (  293.75 ms per token,     3.40 tokens per second)
       eval time = 1166169.75 ms /  1215 tokens (  959.81 ms per token,     1.04 tokens per second)
      total time = 1176744.69 ms /  1251 tokens

关闭NUMA之后推理速度达到 1.039 tokens/s

slot launch_slot_: id  0 | task 1216 | processing task
slot update_slots: id  0 | task 1216 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 27
slot update_slots: id  0 | task 1216 | kv cache rm [1, end)
slot update_slots: id  0 | task 1216 | prompt processing progress, n_past = 27, n_tokens = 26, progress = 0.962963
slot update_slots: id  0 | task 1216 | prompt done, n_past = 27, n_tokens = 26
slot      release: id  0 | task 1216 | stop processing: n_past = 1477, truncated = 0
slot print_timing: id  0 | task 1216 | 
prompt eval time =    7400.93 ms /    26 tokens (  284.65 ms per token,     3.51 tokens per second)
       eval time = 1414387.96 ms /  1451 tokens (  974.77 ms per token,     1.03 tokens per second)
      total time = 1421788.89 ms /  1477 tokens

可以看到推理速度达到 1.063 tokens/s ，也就是比没有关闭NUMA之前的 0.66 tokens/s 提高了 61% ，性能提升还是比较明显的，虽然每秒1个token的推理速度依然实用性不强。

备注

Intel Turbo Boost技术和intel_pstate 可以在Turbo Boost模式下达到 3.1G Hz，比默认的 2.5G Hz快 24% 。也就是说，理论上推理速度可以达到 1.32 tokens/s