Ollama异常:大量kworker D
我在使用 Ollama使用AMD GPU运行大模型 ,向 Qwen2.5-coder 发现ollama客户端长时间没有响应并突然推出(进程显示为 defunc
)。
此时检查发现系统负载极高,Load超过 250+,但同时CPU是完全空闲的。
top
输出top - 12:02:57 up 4:20, 2 users, load average: 253.00, 251.90, 225.63
Tasks: 611 total, 1 running, 609 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.4 us, 0.4 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 773510.4 total, 733498.6 free, 7075.1 used, 37734.3 buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 766435.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3254 huatai 20 0 11972 5568 3456 R 18.2 0.0 0:00.03 top
1 root 20 0 22496 12260 8804 S 0.0 0.0 0:03.95 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pool_workqueue_release
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/R-rcu_g
5 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/R-rcu_p
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/R-slub_
7 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/R-netns
9 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-events_highpri
11 root 20 0 0 0 0 I 0.0 0.0 0:11.16 kworker/u48:0-ext4-rsv-conversion
12 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/R-mm_pe
...
既然cpu没有压力,但是load高,那么应该是有进程 D
住了
检查哪些进程
D
住了:
ps
检查进程Dps r -A
根据进程,检查堆栈:
huatai@zcloud:~$ sudo cat /proc/237/stack
[<0>] dma_fence_default_wait+0x1e1/0x220
[<0>] dma_fence_wait_timeout+0x116/0x140
[<0>] amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[<0>] process_one_work+0x184/0x3a0
[<0>] worker_thread+0x306/0x440
[<0>] kthread+0xf2/0x120
[<0>] ret_from_fork+0x47/0x70
[<0>] ret_from_fork_asm+0x1b/0x30
huatai@zcloud:~$ sudo cat /proc/248/stack
[<0>] dma_fence_default_wait+0x1e1/0x220
[<0>] dma_fence_wait_timeout+0x116/0x140
[<0>] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[<0>] ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[<0>] process_one_work+0x184/0x3a0
[<0>] worker_thread+0x306/0x440
[<0>] kthread+0xf2/0x120
[<0>] ret_from_fork+0x47/0x70
[<0>] ret_from_fork_asm+0x1b/0x30
huatai@zcloud:~$ sudo cat /proc/254/stack
[<0>] dma_fence_default_wait+0x1e1/0x220
[<0>] dma_fence_wait_timeout+0x116/0x140
[<0>] amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[<0>] process_one_work+0x184/0x3a0
[<0>] worker_thread+0x306/0x440
[<0>] kthread+0xf2/0x120
[<0>] ret_from_fork+0x47/0x70
[<0>] ret_from_fork_asm+0x1b/0x30
可以看到都是 amdgpu
相关进程陷入等待
检查
dmesg
可以看到hang住的异常:
dmesg
异常hang日志[ 6585.075781] amdgpu 0000:0d:00.0: Using 44-bit DMA addresses
[ 6896.822820] amdgpu: Freeing queue vital buffer 0x75acbea00000, queue evicted
[ 6896.822833] amdgpu: Freeing queue vital buffer 0x75acda800000, queue evicted
[ 6896.822837] amdgpu: Freeing queue vital buffer 0x75ad21400000, queue evicted
[ 6896.822840] amdgpu: Freeing queue vital buffer 0x75ad22a00000, queue evicted
[ 6896.899863] [drm:amddrm_sched_entity_push_job [amd_sched]] *ERROR* Trying to push to a killed entity
[ 7136.555072] INFO: task kworker/11:1:237 blocked for more than 122 seconds.
[ 7136.555112] Tainted: G OE 6.8.0-71-generic #71-Ubuntu
[ 7136.555137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.555163] task:kworker/11:1 state:D stack:0 pid:237 tgid:237 ppid:2 flags:0x00004000
[ 7136.555173] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[ 7136.556089] Call Trace:
[ 7136.556093] <TASK>
[ 7136.556100] __schedule+0x27c/0x6b0
[ 7136.556115] schedule+0x33/0x110
[ 7136.556121] schedule_timeout+0x157/0x170
[ 7136.556131] dma_fence_default_wait+0x1e1/0x220
[ 7136.556144] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.556152] dma_fence_wait_timeout+0x116/0x140
[ 7136.556160] amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[ 7136.556871] process_one_work+0x184/0x3a0
[ 7136.556882] worker_thread+0x306/0x440
[ 7136.556888] ? __pfx_worker_thread+0x10/0x10
[ 7136.556894] kthread+0xf2/0x120
[ 7136.556904] ? __pfx_kthread+0x10/0x10
[ 7136.556912] ret_from_fork+0x47/0x70
[ 7136.556922] ? __pfx_kthread+0x10/0x10
[ 7136.556929] ret_from_fork_asm+0x1b/0x30
[ 7136.556940] </TASK>
[ 7136.556943] INFO: task kworker/u53:0:248 blocked for more than 122 seconds.
[ 7136.556973] Tainted: G OE 6.8.0-71-generic #71-Ubuntu
[ 7136.556999] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.557036] task:kworker/u53:0 state:D stack:0 pid:248 tgid:248 ppid:2 flags:0x00004000
[ 7136.557045] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[ 7136.557070] Call Trace:
[ 7136.557072] <TASK>
[ 7136.557076] __schedule+0x27c/0x6b0
[ 7136.557086] schedule+0x33/0x110
[ 7136.557093] schedule_timeout+0x157/0x170
[ 7136.557100] dma_fence_default_wait+0x1e1/0x220
[ 7136.557108] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.557116] dma_fence_wait_timeout+0x116/0x140
[ 7136.557127] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[ 7136.557141] ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[ 7136.557157] process_one_work+0x184/0x3a0
[ 7136.557164] worker_thread+0x306/0x440
[ 7136.557170] ? __pfx_worker_thread+0x10/0x10
[ 7136.557176] kthread+0xf2/0x120
[ 7136.557184] ? __pfx_kthread+0x10/0x10
[ 7136.557192] ret_from_fork+0x47/0x70
[ 7136.557199] ? __pfx_kthread+0x10/0x10
[ 7136.557207] ret_from_fork_asm+0x1b/0x30
[ 7136.557215] </TASK>
[ 7136.557221] INFO: task kworker/11:2:335 blocked for more than 122 seconds.
[ 7136.557250] Tainted: G OE 6.8.0-71-generic #71-Ubuntu
[ 7136.557275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.557301] task:kworker/11:2 state:D stack:0 pid:335 tgid:335 ppid:2 flags:0x00004000
[ 7136.557309] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[ 7136.558054] Call Trace:
[ 7136.558057] <TASK>
[ 7136.558061] __schedule+0x27c/0x6b0
[ 7136.558070] schedule+0x33/0x110
[ 7136.558077] schedule_timeout+0x157/0x170
[ 7136.558084] dma_fence_default_wait+0x1e1/0x220
[ 7136.558092] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.558100] dma_fence_wait_timeout+0x116/0x140
[ 7136.558109] amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[ 7136.558843] process_one_work+0x184/0x3a0
[ 7136.558850] worker_thread+0x306/0x440
[ 7136.558855] ? _raw_spin_lock_irqsave+0xe/0x20
[ 7136.558863] ? __pfx_worker_thread+0x10/0x10
[ 7136.558868] kthread+0xf2/0x120
[ 7136.558876] ? __pfx_kthread+0x10/0x10
[ 7136.558884] ret_from_fork+0x47/0x70
[ 7136.558891] ? __pfx_kthread+0x10/0x10
[ 7136.558898] ret_from_fork_asm+0x1b/0x30
[ 7136.558906] </TASK>
[ 7136.558935] INFO: task kworker/11:0:2667 blocked for more than 122 seconds.
[ 7136.558965] Tainted: G OE 6.8.0-71-generic #71-Ubuntu
[ 7136.558992] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.559026] task:kworker/11:0 state:D stack:0 pid:2667 tgid:2667 ppid:2 flags:0x00004000
[ 7136.559035] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[ 7136.559772] Call Trace:
[ 7136.559774] <TASK>
[ 7136.559777] __schedule+0x27c/0x6b0
[ 7136.559786] schedule+0x33/0x110
[ 7136.559793] schedule_timeout+0x157/0x170
[ 7136.559800] dma_fence_default_wait+0x1e1/0x220
[ 7136.559808] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.559815] dma_fence_wait_timeout+0x116/0x140
[ 7136.559824] amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[ 7136.560565] process_one_work+0x184/0x3a0
[ 7136.560573] worker_thread+0x306/0x440
[ 7136.560579] ? _raw_spin_lock_irqsave+0xe/0x20
[ 7136.560586] ? __pfx_worker_thread+0x10/0x10
[ 7136.560592] kthread+0xf2/0x120
[ 7136.560600] ? __pfx_kthread+0x10/0x10
[ 7136.560609] ret_from_fork+0x47/0x70
[ 7136.560616] ? __pfx_kthread+0x10/0x10
[ 7136.560623] ret_from_fork_asm+0x1b/0x30
[ 7136.560632] </TASK>
[ 7136.560635] INFO: task kworker/u53:1:2671 blocked for more than 122 seconds.
[ 7136.560664] Tainted: G OE 6.8.0-71-generic #71-Ubuntu
[ 7136.560690] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.560716] task:kworker/u53:1 state:D stack:0 pid:2671 tgid:2671 ppid:2 flags:0x00004000
[ 7136.560724] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[ 7136.560742] Call Trace:
[ 7136.560744] <TASK>
[ 7136.560747] __schedule+0x27c/0x6b0
[ 7136.560756] schedule+0x33/0x110
[ 7136.560762] schedule_timeout+0x157/0x170
[ 7136.560769] dma_fence_default_wait+0x1e1/0x220
[ 7136.560777] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.560784] dma_fence_wait_timeout+0x116/0x140
[ 7136.560793] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[ 7136.560804] ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[ 7136.560820] process_one_work+0x184/0x3a0
[ 7136.560827] worker_thread+0x306/0x440
[ 7136.560833] ? _raw_spin_lock_irqsave+0xe/0x20
[ 7136.560840] ? __pfx_worker_thread+0x10/0x10
[ 7136.560846] kthread+0xf2/0x120
[ 7136.560853] ? __pfx_kthread+0x10/0x10
[ 7136.560861] ret_from_fork+0x47/0x70
[ 7136.560868] ? __pfx_kthread+0x10/0x10
[ 7136.560875] ret_from_fork_asm+0x1b/0x30
[ 7136.560883] </TASK>
[ 7136.560886] INFO: task kworker/u53:2:2672 blocked for more than 122 seconds.
[ 7136.560914] Tainted: G OE 6.8.0-71-generic #71-Ubuntu
[ 7136.560939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.560965] task:kworker/u53:2 state:D stack:0 pid:2672 tgid:2672 ppid:2 flags:0x00004000
[ 7136.560972] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[ 7136.560990] Call Trace:
[ 7136.560992] <TASK>
[ 7136.560995] __schedule+0x27c/0x6b0
[ 7136.561004] schedule+0x33/0x110
[ 7136.561010] schedule_timeout+0x157/0x170
[ 7136.561017] dma_fence_default_wait+0x1e1/0x220
[ 7136.561033] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.561041] dma_fence_wait_timeout+0x116/0x140
[ 7136.561050] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[ 7136.561061] ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[ 7136.561078] process_one_work+0x184/0x3a0
[ 7136.561085] worker_thread+0x306/0x440
[ 7136.561092] ? __pfx_worker_thread+0x10/0x10
[ 7136.561098] kthread+0xf2/0x120
[ 7136.561106] ? __pfx_kthread+0x10/0x10
[ 7136.561114] ret_from_fork+0x47/0x70
[ 7136.561121] ? __pfx_kthread+0x10/0x10
[ 7136.561128] ret_from_fork_asm+0x1b/0x30
[ 7136.561136] </TASK>
[ 7136.561139] INFO: task kworker/u53:3:2673 blocked for more than 122 seconds.
[ 7136.561167] Tainted: G OE 6.8.0-71-generic #71-Ubuntu
[ 7136.561193] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.561219] task:kworker/u53:3 state:D stack:0 pid:2673 tgid:2673 ppid:2 flags:0x00004000
[ 7136.561226] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[ 7136.561244] Call Trace:
[ 7136.561246] <TASK>
[ 7136.561249] __schedule+0x27c/0x6b0
[ 7136.561259] schedule+0x33/0x110
[ 7136.561265] schedule_timeout+0x157/0x170
[ 7136.561273] dma_fence_default_wait+0x1e1/0x220
[ 7136.561281] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.561288] dma_fence_wait_timeout+0x116/0x140
[ 7136.561297] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[ 7136.561307] ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[ 7136.561323] process_one_work+0x184/0x3a0
[ 7136.561330] worker_thread+0x306/0x440
[ 7136.561336] ? __pfx_worker_thread+0x10/0x10
[ 7136.561342] kthread+0xf2/0x120
[ 7136.561349] ? __pfx_kthread+0x10/0x10
[ 7136.561357] ret_from_fork+0x47/0x70
[ 7136.561364] ? __pfx_kthread+0x10/0x10
[ 7136.561371] ret_from_fork_asm+0x1b/0x30
[ 7136.561379] </TASK>
[ 7136.561381] INFO: task kworker/u53:4:2674 blocked for more than 122 seconds.
[ 7136.561409] Tainted: G OE 6.8.0-71-generic #71-Ubuntu
[ 7136.561434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.561460] task:kworker/u53:4 state:D stack:0 pid:2674 tgid:2674 ppid:2 flags:0x00004000
[ 7136.561467] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[ 7136.561484] Call Trace:
[ 7136.561486] <TASK>
[ 7136.561489] __schedule+0x27c/0x6b0
[ 7136.561498] schedule+0x33/0x110
[ 7136.561504] schedule_timeout+0x157/0x170
[ 7136.561511] dma_fence_default_wait+0x1e1/0x220
[ 7136.561519] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.561526] dma_fence_wait_timeout+0x116/0x140
[ 7136.561535] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[ 7136.561545] ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[ 7136.561561] process_one_work+0x184/0x3a0
[ 7136.561568] worker_thread+0x306/0x440
[ 7136.561574] ? __pfx_worker_thread+0x10/0x10
[ 7136.561580] kthread+0xf2/0x120
[ 7136.561588] ? __pfx_kthread+0x10/0x10
[ 7136.561596] ret_from_fork+0x47/0x70
[ 7136.561603] ? __pfx_kthread+0x10/0x10
[ 7136.561610] ret_from_fork_asm+0x1b/0x30
[ 7136.561618] </TASK>
[ 7136.561622] INFO: task kworker/11:3:2685 blocked for more than 122 seconds.
[ 7136.561650] Tainted: G OE 6.8.0-71-generic #71-Ubuntu
[ 7136.561676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.561702] task:kworker/11:3 state:D stack:0 pid:2685 tgid:2685 ppid:2 flags:0x00004000
[ 7136.561708] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[ 7136.562457] Call Trace:
[ 7136.562460] <TASK>
[ 7136.562463] __schedule+0x27c/0x6b0
[ 7136.562473] schedule+0x33/0x110
[ 7136.562480] schedule_timeout+0x157/0x170
[ 7136.562487] dma_fence_default_wait+0x1e1/0x220
[ 7136.562495] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.562503] dma_fence_wait_timeout+0x116/0x140
[ 7136.562512] amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[ 7136.563254] process_one_work+0x184/0x3a0
[ 7136.563262] worker_thread+0x306/0x440
[ 7136.563269] ? __pfx_worker_thread+0x10/0x10
[ 7136.563274] kthread+0xf2/0x120
[ 7136.563282] ? __pfx_kthread+0x10/0x10
[ 7136.563290] ret_from_fork+0x47/0x70
[ 7136.563297] ? __pfx_kthread+0x10/0x10
[ 7136.563305] ret_from_fork_asm+0x1b/0x30
[ 7136.563313] </TASK>
[ 7136.563316] INFO: task kworker/11:4:2686 blocked for more than 122 seconds.
[ 7136.563346] Tainted: G OE 6.8.0-71-generic #71-Ubuntu
[ 7136.563372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7136.563398] task:kworker/11:4 state:D stack:0 pid:2686 tgid:2686 ppid:2 flags:0x00004000
[ 7136.563405] Workqueue: events amdgpu_tlb_fence_work [amdgpu]
[ 7136.564150] Call Trace:
[ 7136.564153] <TASK>
[ 7136.564156] __schedule+0x27c/0x6b0
[ 7136.564165] schedule+0x33/0x110
[ 7136.564171] schedule_timeout+0x157/0x170
[ 7136.564179] dma_fence_default_wait+0x1e1/0x220
[ 7136.564187] ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 7136.564195] dma_fence_wait_timeout+0x116/0x140
[ 7136.564204] amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
[ 7136.564937] process_one_work+0x184/0x3a0
[ 7136.564944] worker_thread+0x306/0x440
[ 7136.564950] ? __pfx_worker_thread+0x10/0x10
[ 7136.564956] kthread+0xf2/0x120
[ 7136.564964] ? __pfx_kthread+0x10/0x10
[ 7136.564971] ret_from_fork+0x47/0x70
[ 7136.564978] ? __pfx_kthread+0x10/0x10
[ 7136.564985] ret_from_fork_asm+0x1b/0x30
[ 7136.564993] </TASK>
[ 7136.564996] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
[10863.287212] amdgpu: Freeing queue vital buffer 0x7c9db0c00000, queue evicted
[10863.287227] amdgpu: Freeing queue vital buffer 0x7c9dcaa00000, queue evicted
[10863.287230] amdgpu: Freeing queue vital buffer 0x7c9de6800000, queue evicted
[10863.287234] amdgpu: Freeing queue vital buffer 0x7c9e2d400000, queue evicted
[10863.287236] amdgpu: Freeing queue vital buffer 0x7c9e2ea00000, queue evicted
[10863.369301] [drm:amddrm_sched_entity_push_job [amd_sched]] *ERROR* Trying to push to a killed entity
[14414.071332] amdgpu: Freeing queue vital buffer 0x7df6b7c00000, queue evicted
[14414.071347] amdgpu: Freeing queue vital buffer 0x7e02a0800000, queue evicted
[14414.071351] amdgpu: Freeing queue vital buffer 0x7e02baa00000, queue evicted
[14414.071353] amdgpu: Freeing queue vital buffer 0x7e02d6800000, queue evicted
[14414.071357] amdgpu: Freeing queue vital buffer 0x7e031ce00000, queue evicted
[14414.149353] [drm:amddrm_sched_entity_push_job [amd_sched]] *ERROR* Trying to push to a killed entity
[14896.781645] workqueue: iova_depot_work_func hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
从 amdgpu: Freeing queue vital buffer 0x75acbea00000, queue evicted
报错拉坎,似乎AMD GPU驱动( admgpu
) 在释放一个GPU队列的内存buffer时(evicted驱逐)出现了错误: 也可能驱动程序重置了队列或GPU
GPU重置: AMDGPU 驱动程序可能已启动特定 GPU 队列(甚至整个 GPU)的重置,以从检测到的错误中恢复
应用程序崩溃: 应用程序故障或驱动程序问题可能导致 GPU 队列变得不稳定或无响应,从而促使驱动程序将其清除
硬件问题:这种情况不太常见,也可能表明 GPU 硬件本身存在问题
备注
在 dmesg
中显示的 amdgpu_tlb_fence_work
是AMDGPU Linux内核驱动程序中 Translatio Lookaside Buffer (TLB, 转译后备缓冲器,页表缓存,转址旁路缓存),是计算机体系结构中用于加速虚拟地址到物理地址转换的一种硬件缓存结构。TLB位于内存管理单元(MMU)中,用于存储最近使用的虚拟页号到物理页帧号的映射关系,避免每次访问内存时都查询慢速的页表,从而提升系统性能。
TLB页表缓存,是CPU访问虚拟地址首先在TLB中查找对应的物理地址,如果TLB没有命中,CP就需要到主内存中查询页表,并将这个映射关系添加到TLB中,以便下次访问。
TLB失效/刷新: 当内存映射发生变化(例如,某个页面宝贝释放,重新映射或其权限发生更改)时,TLB中相应条目可能会陈旧或无效。为避免使用这些陈旧条目,系统必须将它们从TLB中失效或刷新,确保CPU始终使用正确的最新的内存映射。这里 fence
栅栏表示同步,确保某些操作(例如内存写入或TLB刷新)在其他操作开始前完成。
amdgpu_tlb_fence_work
应该是GPU内存映射被修改时维护内存一致性和正确性,进行TLB更新以反映最新内存状态。
排查结果
经过几次尝试,我找到原因:
我同时启动了2个 Qwen2.5-coder : 当在服务器上手工运行了
ollama run qwen2.5-coder:32b-instruct-q6_K
,同时又使用了 VS Code插件Continue连接Ollama实现AI辅助编程 (也调用了 Qwen2.5-coder )两个model运行时,后启动当model会将前一个model驱逐掉前一个model,此时系统日志中就会出现几行:
amdgpu: Freeing queue vital buffer 0x75acbea00000, queue evicted
之后就会出现大量当
D
住状态kworker
系统进程,每次切换model都会出现上百个这样D住的系统进程,导致系统负载越来越大,响应缓慢
这个问题可能是AMD驱动问题,也可能是 Ollama 使用了自带 ROCm
和我安装在系统范围的 amdgpu driver
存在配套兼容问题。我准备尝试使用 vLLM 来运行模型,看看是否解决。