I have installed the latest 12.3 cuda tool kit and the corresponding nvidia-driver following nvidia instruction. The GPUs seem to work fine on some other tasks, but crash when I try to run a LLM, causing soft lockup of one CPU.
I’m new to cuda and nvidia driver, so I really can’t understand what’s going on by looking at the attached bug report. I don’t know whatelse I should provide here, plz leave a message if you think other files are needed to solve this problem.
The crash appears in the nvidia kernel log files, starting from line 21437.
nvidia-bug-report.log (3.1 MB)
Looks like a crash in the kernel while doing numa related stuff triggered by cuda-evthandlr. The nvidia driver doesn’t seem to be involved. Never seen something like that before, maybe check if updating the kernel helps.
Not the nvidia driver, the Linux kernel, i.e. please update your system. sudo apt update && sudo apt upgrade
Got it, Thanks! I’ll try if it works.