Nvidia driver lost when running with nvidia-driver 410 + cuda 10 + tensorrt 5.02

Bob_DL · May 15, 2019, 6:59am

Hi,
When I run inference with docker based on nvcr.io/nvidia/tensorrt:19.02-py2, the driver lost after several minutes normal working. After the driver lost, I run nvidia-smi and got log

Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU

The GPU is nvidia-2080ti

the dmesg report is

[ 98.639487] docker0: port 1(veth2d6c917) entered blocking state
[ 98.639492] docker0: port 1(veth2d6c917) entered forwarding state
[ 98.639568] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
[ 99.530962] eth0: renamed from veth4e784b8
[ 99.559491] IPv6: ADDRCONF(NETDEV_CHANGE): veth7768fff: link becomes ready
[ 99.559586] docker0: port 2(veth7768fff) entered blocking state
[ 99.559591] docker0: port 2(veth7768fff) entered forwarding state
[ 100.193772] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 239
[ 146.982594] docker0: port 3(veth45f1ef9) entered blocking state
[ 146.982595] docker0: port 3(veth45f1ef9) entered disabled state
[ 146.982625] device veth45f1ef9 entered promiscuous mode
[ 146.982684] IPv6: ADDRCONF(NETDEV_UP): veth45f1ef9: link is not ready
[ 150.361999] eth0: renamed from veth46afce8
[ 150.374412] IPv6: ADDRCONF(NETDEV_CHANGE): veth45f1ef9: link becomes ready
[ 150.374504] docker0: port 3(veth45f1ef9) entered blocking state
[ 150.374508] docker0: port 3(veth45f1ef9) entered forwarding state
[ 363.263945] ata1: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
[ 363.263948] ata1: irq_stat 0x00400040, connection status changed
[ 363.263950] ata1: SError: { HostInt PHYRdyChg 10B8B DevExch }
[ 363.263952] ata1: hard resetting link
[ 363.714451] NVRM: GPU at PCI:0000:01:00: GPU-fdd57c3d-31d6-31f9-9313-21b3ab2a8112
[ 363.714456] NVRM: GPU Board Serial Number:
[ 363.714458] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 363.714461] NVRM: GPU at 00000000:01:00.0 has fallen off the bus.
[ 363.714461] NVRM: GPU is on Board .
[ 363.714477] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 366.530746] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 366.542075] ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[ 366.542076] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[ 366.542078] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[ 366.636809] ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[ 366.636811] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[ 366.636812] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[ 366.637330] ata1.00: configured for UDMA/100
[ 366.637332] ata1: EH complete

And the nvidia-bug-report.sh report is as below

nvidia-bug-report.log.gz (1.01 MB)

generix · May 15, 2019, 7:18am

It’s truncated, please delete the wall-of-text and attach as file. Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/
You’re running into XID 79, meaning overheating or insufficient power supply.
You can log temperatures using

nvidia-smi -q -d TEMPERATURE -l 2 -f temp.log

Bob_DL · May 15, 2019, 12:17pm

I have upload the

It’s truncated, please delete the wall-of-text and attach as file. Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/
You’re running into XID 79, meaning overheating or insufficient power supply.
You can log temperatures using
nvidia-smi -q -d TEMPERATURE -l 2 -f temp.log

I have upload the full report.
Here are more info. I have test the 2080Ti with gpu_burn(GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test)
, runing for 30minutes, no issue happened.

With 2080Ti, the issue would happen just several minutes after start. The temperature is under 80C from nvidia-smi.

And I have changed the 2080TI to the same type,but, another brand, the same issue happened

generix · May 15, 2019, 12:27pm

nvidia-docker complains about too little shmem.
That aside, my guess would be a flawed/insufficient PSU. What brand/model is it?

Bob_DL · May 15, 2019, 1:13pm

It’s GreatWall 600W. I don’t remember the detail part number.
But, as it could pass gpu_burn + stress -c test. It should be little chance of PSU issue?

And I have update the docker create command with --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864, but it can’t solve the issue.

generix · May 15, 2019, 1:23pm

Running gpu_burn, you’ll only get a prolonged power draw of about 280W, but the 2080ti will peak at about 400W during boost. A good, flawless 600W PSU should be sufficient but a degraded/low quality one is not.

generix · May 15, 2019, 1:44pm

You should update your bios, seems to be a very old version.

Topic		Replies	Views
GPU is lost. Reboot the system to recover this GPU DGX User Forum hw , kernel	3	5275	March 8, 2022
GTX 1080 Ti falling off bus Linux	19	2326	September 3, 2018
GPU loss Linux	7	13671	April 3, 2019
2080Ti got ERR soon after starting DL training Linux	11	1764	February 2, 2019
Nvidia driver for 2080 ti causes one AMD CPU to lock up (Ubuntu) Linux ubuntu	12	5110	April 20, 2021
Nvidia-smi shows ‘no devices were found’ after RTX 2080 Ti crashed during cuda job Linux	4	1186	March 19, 2020
GPU accelerated LAMMPS running for a while then stop with Cuda driver error 600 CUDA Setup and Installation	8	1723	October 1, 2020
RTX 2080 Ti always Power Cap and low utilization Linux	14	4620	November 6, 2019
RTX 2080 cards crashed when training longer a PyTorch model Linux	4	1110	November 6, 2019
Nvidia-smi can't communicate with driver -- docker-desktop conflict? CUDA on Windows Subsystem for Linux cuda , wsl	3	2510	April 10, 2023

Nvidia driver lost when running with nvidia-driver 410 + cuda 10 + tensorrt 5.02

Related topics