Ollama异常:大量kworker D

我在使用 Ollama使用AMD GPU运行大模型 ,向 Qwen2.5-coder 发现ollama客户端长时间没有响应并突然推出(进程显示为 defunc )。

此时检查发现系统负载极高,Load超过 250+,但同时CPU是完全空闲的。

检查 top 输出
top - 12:02:57 up  4:20,  2 users,  load average: 253.00, 251.90, 225.63
Tasks: 611 total,   1 running, 609 sleeping,   0 stopped,   1 zombie
%Cpu(s):  0.4 us,  0.4 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 773510.4 total, 733498.6 free,   7075.1 used,  37734.3 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 766435.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   3254 huatai    20   0   11972   5568   3456 R  18.2   0.0   0:00.03 top
      1 root      20   0   22496  12260   8804 S   0.0   0.0   0:03.95 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.04 kthreadd
      3 root      20   0       0      0      0 S   0.0   0.0   0:00.00 pool_workqueue_release
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/R-rcu_g
      5 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/R-rcu_p
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/R-slub_
      7 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/R-netns
      9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-events_highpri
     11 root      20   0       0      0      0 I   0.0   0.0   0:11.16 kworker/u48:0-ext4-rsv-conversion
     12 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/R-mm_pe
     ...

既然cpu没有压力,但是load高,那么应该是有进程 D 住了

  • 检查哪些进程 D 住了:

使用 ps 检查进程D
ps r -A
  • 根据进程,检查堆栈:

检查D住进程堆栈
huatai@zcloud:~$ sudo cat /proc/237/stack
[<0>] dma_fence_default_wait+0x1e1/0x220
[<0>] dma_fence_wait_timeout+0x116/0x140
[<0>] amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[<0>] process_one_work+0x184/0x3a0
[<0>] worker_thread+0x306/0x440
[<0>] kthread+0xf2/0x120
[<0>] ret_from_fork+0x47/0x70
[<0>] ret_from_fork_asm+0x1b/0x30

huatai@zcloud:~$ sudo cat /proc/248/stack
[<0>] dma_fence_default_wait+0x1e1/0x220
[<0>] dma_fence_wait_timeout+0x116/0x140
[<0>] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[<0>] ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[<0>] process_one_work+0x184/0x3a0
[<0>] worker_thread+0x306/0x440
[<0>] kthread+0xf2/0x120
[<0>] ret_from_fork+0x47/0x70
[<0>] ret_from_fork_asm+0x1b/0x30

huatai@zcloud:~$ sudo cat /proc/254/stack
[<0>] dma_fence_default_wait+0x1e1/0x220
[<0>] dma_fence_wait_timeout+0x116/0x140
[<0>] amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[<0>] process_one_work+0x184/0x3a0
[<0>] worker_thread+0x306/0x440
[<0>] kthread+0xf2/0x120
[<0>] ret_from_fork+0x47/0x70
[<0>] ret_from_fork_asm+0x1b/0x30

可以看到都是 amdgpu 相关进程陷入等待

  • 检查 dmesg 可以看到hang住的异常:

dmesg 异常hang日志
[ 6585.075781] amdgpu 0000:0d:00.0: Using 44-bit DMA addresses
[ 6896.822820] amdgpu: Freeing queue vital buffer 0x75acbea00000, queue evicted
[ 6896.822833] amdgpu: Freeing queue vital buffer 0x75acda800000, queue evicted
[ 6896.822837] amdgpu: Freeing queue vital buffer 0x75ad21400000, queue evicted
[ 6896.822840] amdgpu: Freeing queue vital buffer 0x75ad22a00000, queue evicted
[ 6896.899863] [drm:amddrm_sched_entity_push_job [amd_sched]] *ERROR* Trying to push to a killed entity
[ 7136.555072] INFO: task kworker/11:1:237 blocked for more than 122 seconds.
[ 7136.555112]       Tainted: G           OE      6.8.0-71-generic #71-Ubuntu
[ 7136.555137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.555163] task:kworker/11:1    state:D stack:0     pid:237   tgid:237   ppid:2      flags:0x00004000
[ 7136.555173] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[ 7136.556089] Call Trace:
[ 7136.556093]  <TASK>
[ 7136.556100]  __schedule+0x27c/0x6b0
[ 7136.556115]  schedule+0x33/0x110
[ 7136.556121]  schedule_timeout+0x157/0x170
[ 7136.556131]  dma_fence_default_wait+0x1e1/0x220
[ 7136.556144]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.556152]  dma_fence_wait_timeout+0x116/0x140
[ 7136.556160]  amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[ 7136.556871]  process_one_work+0x184/0x3a0
[ 7136.556882]  worker_thread+0x306/0x440
[ 7136.556888]  ? __pfx_worker_thread+0x10/0x10
[ 7136.556894]  kthread+0xf2/0x120
[ 7136.556904]  ? __pfx_kthread+0x10/0x10
[ 7136.556912]  ret_from_fork+0x47/0x70
[ 7136.556922]  ? __pfx_kthread+0x10/0x10
[ 7136.556929]  ret_from_fork_asm+0x1b/0x30
[ 7136.556940]  </TASK>
[ 7136.556943] INFO: task kworker/u53:0:248 blocked for more than 122 seconds.
[ 7136.556973]       Tainted: G           OE      6.8.0-71-generic #71-Ubuntu
[ 7136.556999] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.557036] task:kworker/u53:0   state:D stack:0     pid:248   tgid:248   ppid:2      flags:0x00004000
[ 7136.557045] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[ 7136.557070] Call Trace:
[ 7136.557072]  <TASK>
[ 7136.557076]  __schedule+0x27c/0x6b0
[ 7136.557086]  schedule+0x33/0x110
[ 7136.557093]  schedule_timeout+0x157/0x170
[ 7136.557100]  dma_fence_default_wait+0x1e1/0x220
[ 7136.557108]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.557116]  dma_fence_wait_timeout+0x116/0x140
[ 7136.557127]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[ 7136.557141]  ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[ 7136.557157]  process_one_work+0x184/0x3a0
[ 7136.557164]  worker_thread+0x306/0x440
[ 7136.557170]  ? __pfx_worker_thread+0x10/0x10
[ 7136.557176]  kthread+0xf2/0x120
[ 7136.557184]  ? __pfx_kthread+0x10/0x10
[ 7136.557192]  ret_from_fork+0x47/0x70
[ 7136.557199]  ? __pfx_kthread+0x10/0x10
[ 7136.557207]  ret_from_fork_asm+0x1b/0x30
[ 7136.557215]  </TASK>
[ 7136.557221] INFO: task kworker/11:2:335 blocked for more than 122 seconds.
[ 7136.557250]       Tainted: G           OE      6.8.0-71-generic #71-Ubuntu
[ 7136.557275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.557301] task:kworker/11:2    state:D stack:0     pid:335   tgid:335   ppid:2      flags:0x00004000
[ 7136.557309] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[ 7136.558054] Call Trace:
[ 7136.558057]  <TASK>
[ 7136.558061]  __schedule+0x27c/0x6b0
[ 7136.558070]  schedule+0x33/0x110
[ 7136.558077]  schedule_timeout+0x157/0x170
[ 7136.558084]  dma_fence_default_wait+0x1e1/0x220
[ 7136.558092]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.558100]  dma_fence_wait_timeout+0x116/0x140
[ 7136.558109]  amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[ 7136.558843]  process_one_work+0x184/0x3a0
[ 7136.558850]  worker_thread+0x306/0x440
[ 7136.558855]  ? _raw_spin_lock_irqsave+0xe/0x20
[ 7136.558863]  ? __pfx_worker_thread+0x10/0x10
[ 7136.558868]  kthread+0xf2/0x120
[ 7136.558876]  ? __pfx_kthread+0x10/0x10
[ 7136.558884]  ret_from_fork+0x47/0x70
[ 7136.558891]  ? __pfx_kthread+0x10/0x10
[ 7136.558898]  ret_from_fork_asm+0x1b/0x30
[ 7136.558906]  </TASK>
[ 7136.558935] INFO: task kworker/11:0:2667 blocked for more than 122 seconds.
[ 7136.558965]       Tainted: G           OE      6.8.0-71-generic #71-Ubuntu
[ 7136.558992] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.559026] task:kworker/11:0    state:D stack:0     pid:2667  tgid:2667  ppid:2      flags:0x00004000
[ 7136.559035] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[ 7136.559772] Call Trace:
[ 7136.559774]  <TASK>
[ 7136.559777]  __schedule+0x27c/0x6b0
[ 7136.559786]  schedule+0x33/0x110
[ 7136.559793]  schedule_timeout+0x157/0x170
[ 7136.559800]  dma_fence_default_wait+0x1e1/0x220
[ 7136.559808]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.559815]  dma_fence_wait_timeout+0x116/0x140
[ 7136.559824]  amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[ 7136.560565]  process_one_work+0x184/0x3a0
[ 7136.560573]  worker_thread+0x306/0x440
[ 7136.560579]  ? _raw_spin_lock_irqsave+0xe/0x20
[ 7136.560586]  ? __pfx_worker_thread+0x10/0x10
[ 7136.560592]  kthread+0xf2/0x120
[ 7136.560600]  ? __pfx_kthread+0x10/0x10
[ 7136.560609]  ret_from_fork+0x47/0x70
[ 7136.560616]  ? __pfx_kthread+0x10/0x10
[ 7136.560623]  ret_from_fork_asm+0x1b/0x30
[ 7136.560632]  </TASK>
[ 7136.560635] INFO: task kworker/u53:1:2671 blocked for more than 122 seconds.
[ 7136.560664]       Tainted: G           OE      6.8.0-71-generic #71-Ubuntu
[ 7136.560690] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.560716] task:kworker/u53:1   state:D stack:0     pid:2671  tgid:2671  ppid:2      flags:0x00004000
[ 7136.560724] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[ 7136.560742] Call Trace:
[ 7136.560744]  <TASK>
[ 7136.560747]  __schedule+0x27c/0x6b0
[ 7136.560756]  schedule+0x33/0x110
[ 7136.560762]  schedule_timeout+0x157/0x170
[ 7136.560769]  dma_fence_default_wait+0x1e1/0x220
[ 7136.560777]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.560784]  dma_fence_wait_timeout+0x116/0x140
[ 7136.560793]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[ 7136.560804]  ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[ 7136.560820]  process_one_work+0x184/0x3a0
[ 7136.560827]  worker_thread+0x306/0x440
[ 7136.560833]  ? _raw_spin_lock_irqsave+0xe/0x20
[ 7136.560840]  ? __pfx_worker_thread+0x10/0x10
[ 7136.560846]  kthread+0xf2/0x120
[ 7136.560853]  ? __pfx_kthread+0x10/0x10
[ 7136.560861]  ret_from_fork+0x47/0x70
[ 7136.560868]  ? __pfx_kthread+0x10/0x10
[ 7136.560875]  ret_from_fork_asm+0x1b/0x30
[ 7136.560883]  </TASK>
[ 7136.560886] INFO: task kworker/u53:2:2672 blocked for more than 122 seconds.
[ 7136.560914]       Tainted: G           OE      6.8.0-71-generic #71-Ubuntu
[ 7136.560939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.560965] task:kworker/u53:2   state:D stack:0     pid:2672  tgid:2672  ppid:2      flags:0x00004000
[ 7136.560972] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[ 7136.560990] Call Trace:
[ 7136.560992]  <TASK>
[ 7136.560995]  __schedule+0x27c/0x6b0
[ 7136.561004]  schedule+0x33/0x110
[ 7136.561010]  schedule_timeout+0x157/0x170
[ 7136.561017]  dma_fence_default_wait+0x1e1/0x220
[ 7136.561033]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.561041]  dma_fence_wait_timeout+0x116/0x140
[ 7136.561050]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[ 7136.561061]  ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[ 7136.561078]  process_one_work+0x184/0x3a0
[ 7136.561085]  worker_thread+0x306/0x440
[ 7136.561092]  ? __pfx_worker_thread+0x10/0x10
[ 7136.561098]  kthread+0xf2/0x120
[ 7136.561106]  ? __pfx_kthread+0x10/0x10
[ 7136.561114]  ret_from_fork+0x47/0x70
[ 7136.561121]  ? __pfx_kthread+0x10/0x10
[ 7136.561128]  ret_from_fork_asm+0x1b/0x30
[ 7136.561136]  </TASK>
[ 7136.561139] INFO: task kworker/u53:3:2673 blocked for more than 122 seconds.
[ 7136.561167]       Tainted: G           OE      6.8.0-71-generic #71-Ubuntu
[ 7136.561193] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.561219] task:kworker/u53:3   state:D stack:0     pid:2673  tgid:2673  ppid:2      flags:0x00004000
[ 7136.561226] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[ 7136.561244] Call Trace:
[ 7136.561246]  <TASK>
[ 7136.561249]  __schedule+0x27c/0x6b0
[ 7136.561259]  schedule+0x33/0x110
[ 7136.561265]  schedule_timeout+0x157/0x170
[ 7136.561273]  dma_fence_default_wait+0x1e1/0x220
[ 7136.561281]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.561288]  dma_fence_wait_timeout+0x116/0x140
[ 7136.561297]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[ 7136.561307]  ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[ 7136.561323]  process_one_work+0x184/0x3a0
[ 7136.561330]  worker_thread+0x306/0x440
[ 7136.561336]  ? __pfx_worker_thread+0x10/0x10
[ 7136.561342]  kthread+0xf2/0x120
[ 7136.561349]  ? __pfx_kthread+0x10/0x10
[ 7136.561357]  ret_from_fork+0x47/0x70
[ 7136.561364]  ? __pfx_kthread+0x10/0x10
[ 7136.561371]  ret_from_fork_asm+0x1b/0x30
[ 7136.561379]  </TASK>
[ 7136.561381] INFO: task kworker/u53:4:2674 blocked for more than 122 seconds.
[ 7136.561409]       Tainted: G           OE      6.8.0-71-generic #71-Ubuntu
[ 7136.561434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.561460] task:kworker/u53:4   state:D stack:0     pid:2674  tgid:2674  ppid:2      flags:0x00004000
[ 7136.561467] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[ 7136.561484] Call Trace:
[ 7136.561486]  <TASK>
[ 7136.561489]  __schedule+0x27c/0x6b0
[ 7136.561498]  schedule+0x33/0x110
[ 7136.561504]  schedule_timeout+0x157/0x170
[ 7136.561511]  dma_fence_default_wait+0x1e1/0x220
[ 7136.561519]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.561526]  dma_fence_wait_timeout+0x116/0x140
[ 7136.561535]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[ 7136.561545]  ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[ 7136.561561]  process_one_work+0x184/0x3a0
[ 7136.561568]  worker_thread+0x306/0x440
[ 7136.561574]  ? __pfx_worker_thread+0x10/0x10
[ 7136.561580]  kthread+0xf2/0x120
[ 7136.561588]  ? __pfx_kthread+0x10/0x10
[ 7136.561596]  ret_from_fork+0x47/0x70
[ 7136.561603]  ? __pfx_kthread+0x10/0x10
[ 7136.561610]  ret_from_fork_asm+0x1b/0x30
[ 7136.561618]  </TASK>
[ 7136.561622] INFO: task kworker/11:3:2685 blocked for more than 122 seconds.
[ 7136.561650]       Tainted: G           OE      6.8.0-71-generic #71-Ubuntu
[ 7136.561676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.561702] task:kworker/11:3    state:D stack:0     pid:2685  tgid:2685  ppid:2      flags:0x00004000
[ 7136.561708] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[ 7136.562457] Call Trace:
[ 7136.562460]  <TASK>
[ 7136.562463]  __schedule+0x27c/0x6b0
[ 7136.562473]  schedule+0x33/0x110
[ 7136.562480]  schedule_timeout+0x157/0x170
[ 7136.562487]  dma_fence_default_wait+0x1e1/0x220
[ 7136.562495]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.562503]  dma_fence_wait_timeout+0x116/0x140
[ 7136.562512]  amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[ 7136.563254]  process_one_work+0x184/0x3a0
[ 7136.563262]  worker_thread+0x306/0x440
[ 7136.563269]  ? __pfx_worker_thread+0x10/0x10
[ 7136.563274]  kthread+0xf2/0x120
[ 7136.563282]  ? __pfx_kthread+0x10/0x10
[ 7136.563290]  ret_from_fork+0x47/0x70
[ 7136.563297]  ? __pfx_kthread+0x10/0x10
[ 7136.563305]  ret_from_fork_asm+0x1b/0x30
[ 7136.563313]  </TASK>
[ 7136.563316] INFO: task kworker/11:4:2686 blocked for more than 122 seconds.
[ 7136.563346]       Tainted: G           OE      6.8.0-71-generic #71-Ubuntu
[ 7136.563372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.563398] task:kworker/11:4    state:D stack:0     pid:2686  tgid:2686  ppid:2      flags:0x00004000
[ 7136.563405] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[ 7136.564150] Call Trace:
[ 7136.564153]  <TASK>
[ 7136.564156]  __schedule+0x27c/0x6b0
[ 7136.564165]  schedule+0x33/0x110
[ 7136.564171]  schedule_timeout+0x157/0x170
[ 7136.564179]  dma_fence_default_wait+0x1e1/0x220
[ 7136.564187]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.564195]  dma_fence_wait_timeout+0x116/0x140
[ 7136.564204]  amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[ 7136.564937]  process_one_work+0x184/0x3a0
[ 7136.564944]  worker_thread+0x306/0x440
[ 7136.564950]  ? __pfx_worker_thread+0x10/0x10
[ 7136.564956]  kthread+0xf2/0x120
[ 7136.564964]  ? __pfx_kthread+0x10/0x10
[ 7136.564971]  ret_from_fork+0x47/0x70
[ 7136.564978]  ? __pfx_kthread+0x10/0x10
[ 7136.564985]  ret_from_fork_asm+0x1b/0x30
[ 7136.564993]  </TASK>
[ 7136.564996] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
[10863.287212] amdgpu: Freeing queue vital buffer 0x7c9db0c00000, queue evicted
[10863.287227] amdgpu: Freeing queue vital buffer 0x7c9dcaa00000, queue evicted
[10863.287230] amdgpu: Freeing queue vital buffer 0x7c9de6800000, queue evicted
[10863.287234] amdgpu: Freeing queue vital buffer 0x7c9e2d400000, queue evicted
[10863.287236] amdgpu: Freeing queue vital buffer 0x7c9e2ea00000, queue evicted
[10863.369301] [drm:amddrm_sched_entity_push_job [amd_sched]] *ERROR* Trying to push to a killed entity
[14414.071332] amdgpu: Freeing queue vital buffer 0x7df6b7c00000, queue evicted
[14414.071347] amdgpu: Freeing queue vital buffer 0x7e02a0800000, queue evicted
[14414.071351] amdgpu: Freeing queue vital buffer 0x7e02baa00000, queue evicted
[14414.071353] amdgpu: Freeing queue vital buffer 0x7e02d6800000, queue evicted
[14414.071357] amdgpu: Freeing queue vital buffer 0x7e031ce00000, queue evicted
[14414.149353] [drm:amddrm_sched_entity_push_job [amd_sched]] *ERROR* Trying to push to a killed entity
[14896.781645] workqueue: iova_depot_work_func hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND

amdgpu: Freeing queue vital buffer 0x75acbea00000, queue evicted 报错拉坎,似乎AMD GPU驱动( admgpu ) 在释放一个GPU队列的内存buffer时(evicted驱逐)出现了错误: 也可能驱动程序重置了队列或GPU

  • GPU重置: AMDGPU 驱动程序可能已启动特定 GPU 队列(甚至整个 GPU)的重置,以从检测到的错误中恢复

  • 应用程序崩溃: 应用程序故障或驱动程序问题可能导致 GPU 队列变得不稳定或无响应,从而促使驱动程序将其清除

  • 硬件问题:这种情况不太常见,也可能表明 GPU 硬件本身存在问题

备注

dmesg 中显示的 amdgpu_tlb_fence_work 是AMDGPU Linux内核驱动程序中 Translatio Lookaside Buffer (TLB, 转译后备缓冲器,页表缓存,转址旁路缓存),是计算机体系结构中用于加速虚拟地址到物理地址转换的一种硬件缓存结构。TLB位于内存管理单元(MMU)中,用于存储最近使用的虚拟页号到物理页帧号的映射关系,避免每次访问内存时都查询慢速的页表,从而提升系统性能。

TLB页表缓存,是CPU访问虚拟地址首先在TLB中查找对应的物理地址,如果TLB没有命中,CP就需要到主内存中查询页表,并将这个映射关系添加到TLB中,以便下次访问。

TLB失效/刷新: 当内存映射发生变化(例如,某个页面宝贝释放,重新映射或其权限发生更改)时,TLB中相应条目可能会陈旧或无效。为避免使用这些陈旧条目,系统必须将它们从TLB中失效或刷新,确保CPU始终使用正确的最新的内存映射。这里 fence 栅栏表示同步,确保某些操作(例如内存写入或TLB刷新)在其他操作开始前完成。

amdgpu_tlb_fence_work 应该是GPU内存映射被修改时维护内存一致性和正确性,进行TLB更新以反映最新内存状态。

排查结果

经过几次尝试,我找到原因:

  • 我同时启动了2个 Qwen2.5-coder : 当在服务器上手工运行了 ollama run qwen2.5-coder:32b-instruct-q6_K ,同时又使用了 VS Code插件Continue连接Ollama实现AI辅助编程 (也调用了 Qwen2.5-coder )

  • 两个model运行时,后启动当model会将前一个model驱逐掉前一个model,此时系统日志中就会出现几行: amdgpu: Freeing queue vital buffer 0x75acbea00000, queue evicted

  • 之后就会出现大量当 D 住状态 kworker 系统进程,每次切换model都会出现上百个这样D住的系统进程,导致系统负载越来越大,响应缓慢

这个问题可能是AMD驱动问题,也可能是 Ollama 使用了自带 ROCm 和我安装在系统范围的 amdgpu driver 存在配套兼容问题。我准备尝试使用 vLLM 来运行模型,看看是否解决。