RTX 5000 GPU crash when inference on Ubuntu 16.04

Hi, Nvidia team,
nvidia-bug-report.log.gz (258.5 KB)
When we use RTX 5000 to do inference on Ubuntu 16.04, we got GPU crash issue. The attached is the file generated by nvidia-bug-report.sh. Please help check it.

Thanks
Harry

The main reasons may be:

Solution: update the driver;
The gpu is overheated. There are several solutions to this situation:
Change the system fan speed to manual control and increase the speed
Caused by insufficient power supply, this situation can only be solved by replacing the power supply with more power

Haonan,

Thanks your prompt reply.

You mentioned some solutions:
update the driver
increase the fan’s speed
use a more powerful power supply

I want to confirm if we only update to latest driver(460.32.03), we can resolve this issue, right? Or we also need increase fan’s speed/use a more powerful supply?

Thanks
Harry

Haonan,

When I used nvidia-smi daemon to monitor devices, the result is written to a file with binary format. How can I read the data from binary format?

Thanks
Harry

Just found “nvidia-smi replay” can replay the daemon file.

Hi Harry,

  1. Update the newer GPU driver version: https://www.nvidia.com/Download/driverResults.aspx/168347/en-us
  2. Disable nouveau module
  3. Enabled the persistence mode and configure automatic startup

  4. After GPU failure, it is recommended to use the ipmitool power reset command to restart the server and observe whether the failure disappears or recurs
  5. Use nvidia-smi dmon >1.txt to record GPU frequency and temperature information

Thanks, Haonan.

Is there any side effect to enable persistence-mode?

And what temperature will GPU shutdown itself on?

Harry

Hi Harry

Is there any side effect to enable persistence-mode?
A: Persistence-mode does not affect the functionality
And what temperature will GPU shutdown itself on?
A:
RTX 5000 GPU target temperature: 83°C
RTX 5000 GPU slowdown temperature: 93°C
RTX 5000 GPU shutdown temperature: 96°C

Haonan,

We installed 460.32.03 version, and ran test again. Just reproduced the issue, we kept monitoring the temperature, it was just about 70C. So it should not be overheated problem.

The attached is the file generated by nvidia-bug-report.sh. nvidia-bug-report_011120.log.gz (358.4 KB)

Thanks
Harry