System running RHEL7 crashes sometimes when a user runs a job. RedHat support said to contact you. I have tried several driver updates but it has not fixed the issue. The RHEL kernel should never crash because of a user job. My theory is that there is a bug in the driver that is being encountered by something in the user program. All the jobs that exercise the GPU run fine. RedHat support wants us to use the nouveau drivers because they support that. You support the nvidia drivers so we are contacting you. We can send the sosreport or vmcores if you need them.
Linux bhg0044 3.10.0-1160.6.1.el7.x86_64
GeForce RTX 2080 Ti
Driver Version: 460.67 CUDA Version: 11.2
Thanks,
Carl
What kind of “job”? Please run nvidia-bug-report.sh as root after crash happened and attach the resulting nvidia-bug-report.log.gz file to your post.
The machine crashed again. Here is the nvidia-bug-report.log.gz.
nvidia-bug-report.log.gz (4.2 MB)
Please create /etc/X11/xorg.conf
Section "Device"
Identifier "ASPEED"
Driver "modesetting"
BusID "PCI:4:0:0"
EndSection
and configure nvidia-persistenced to start on boot, make sure it is continuously running and check if that resolves the issue.
I implemented the changes you suggested. I added the lines to xorg.conf and set nvidia-persistanced to run at boot. The daemon was running and the nvidia-smi command seemed to run faster. But the machine still crashed when running a job. The output of the nvidia-debug is attached. nvidia-bug-report.log.gz (1.9 MB)
7 out of 8 gpus look fine now but the last one has SW Power Cap active, that shouldn’t happen while idle.
Besides that, no nvidia related errors are visible in the logs. What kind of user jobs are crashing and how, i.e. what error messages?
I have seen any error messages but the system crashes or reboots. Redhat support says that it is inside the nvidia driver and we should switch to the RHEL supplied open source driver if we them to fix it. So I am submitting to nvidia since you wrote the driver. I have another sosreport from today after the latest crash if you want to see it. A user job should NEVER crash the kernel and cause the machine to reboot.
Neither kernel nor driver ever cause a reboot. Only reason for that is the psu shutting down due to overload. Try limiting clocks using nvidia-smi -lgc to prevent the gpus going into boost thus preventing power spikes.
I have used ‘nvidia-smi -pl 100’ to limit the power consumption of the gpus. Mostly because I could see the results of this. It has helped but I noticed it cannot go below 100W for a power cap. I this a limit in the hardware or can a different driver or driver setting allow this to go lower? I was hoping to try 50W or 75W. I also noticed that a reboot sets the power cap to defaults. Can this be changed?
Thanks,
Carl
It’s a hardware limitation, also displayed by nvidia-smi -q
Like said, rather use -lgc to prevent power spikes, -pl is only effective for average power draw in order to limit temperatures.