[Need for help] CPU soft lockup when running LLMs

I have installed the latest 12.3 cuda tool kit and the corresponding nvidia-driver following nvidia instruction. The GPUs seem to work fine on some other tasks, but crash when I try to run a LLM, causing soft lockup of one CPU.
I’m new to cuda and nvidia driver, so I really can’t understand what’s going on by looking at the attached bug report. I don’t know whatelse I should provide here, plz leave a message if you think other files are needed to solve this problem.
The crash appears in the nvidia kernel log files, starting from line 21437.
nvidia-bug-report.log (3.1 MB)

Looks like a crash in the kernel while doing numa related stuff triggered by cuda-evthandlr. The nvidia driver doesn’t seem to be involved. Never seen something like that before, maybe check if updating the kernel helps.

I used the open kernel module flavour here. Do I have to remove the legacy one?

Not the nvidia driver, the Linux kernel, i.e. please update your system. sudo apt update && sudo apt upgrade

Got it, Thanks! I’ll try if it works.