Nvidia driver lost when running with nvidia-driver 410 + cuda 10 + tensorrt 5.02

Hi,
When I run inference with docker based on nvcr.io/nvidia/tensorrt:19.02-py2, the driver lost after several minutes normal working. After the driver lost, I run nvidia-smi and got log

Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU

The GPU is nvidia-2080ti

the dmesg report is

[ 98.639487] docker0: port 1(veth2d6c917) entered blocking state
[ 98.639492] docker0: port 1(veth2d6c917) entered forwarding state
[ 98.639568] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
[ 99.530962] eth0: renamed from veth4e784b8
[ 99.559491] IPv6: ADDRCONF(NETDEV_CHANGE): veth7768fff: link becomes ready
[ 99.559586] docker0: port 2(veth7768fff) entered blocking state
[ 99.559591] docker0: port 2(veth7768fff) entered forwarding state
[ 100.193772] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 239
[ 146.982594] docker0: port 3(veth45f1ef9) entered blocking state
[ 146.982595] docker0: port 3(veth45f1ef9) entered disabled state
[ 146.982625] device veth45f1ef9 entered promiscuous mode
[ 146.982684] IPv6: ADDRCONF(NETDEV_UP): veth45f1ef9: link is not ready
[ 150.361999] eth0: renamed from veth46afce8
[ 150.374412] IPv6: ADDRCONF(NETDEV_CHANGE): veth45f1ef9: link becomes ready
[ 150.374504] docker0: port 3(veth45f1ef9) entered blocking state
[ 150.374508] docker0: port 3(veth45f1ef9) entered forwarding state
[ 363.263945] ata1: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
[ 363.263948] ata1: irq_stat 0x00400040, connection status changed
[ 363.263950] ata1: SError: { HostInt PHYRdyChg 10B8B DevExch }
[ 363.263952] ata1: hard resetting link
[ 363.714451] NVRM: GPU at PCI:0000:01:00: GPU-fdd57c3d-31d6-31f9-9313-21b3ab2a8112
[ 363.714456] NVRM: GPU Board Serial Number:
[ 363.714458] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 363.714461] NVRM: GPU at 00000000:01:00.0 has fallen off the bus.
[ 363.714461] NVRM: GPU is on Board .
[ 363.714477] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 366.530746] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 366.542075] ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[ 366.542076] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[ 366.542078] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[ 366.636809] ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[ 366.636811] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[ 366.636812] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[ 366.637330] ata1.00: configured for UDMA/100
[ 366.637332] ata1: EH complete

And the nvidia-bug-report.sh report is as below

nvidia-bug-report.log.gz (1.01 MB)

It’s truncated, please delete the wall-of-text and attach as file. Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/
You’re running into XID 79, meaning overheating or insufficient power supply.
You can log temperatures using

nvidia-smi -q -d TEMPERATURE -l 2 -f temp.log

I have upload the

I have upload the full report.
Here are more info. I have test the 2080Ti with gpu_burn(GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test)
, runing for 30minutes, no issue happened.

With 2080Ti, the issue would happen just several minutes after start. The temperature is under 80C from nvidia-smi.

And I have changed the 2080TI to the same type,but, another brand, the same issue happened

nvidia-docker complains about too little shmem.
That aside, my guess would be a flawed/insufficient PSU. What brand/model is it?

It’s GreatWall 600W. I don’t remember the detail part number.
But, as it could pass gpu_burn + stress -c test. It should be little chance of PSU issue?

And I have update the docker create command with --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864, but it can’t solve the issue.

Running gpu_burn, you’ll only get a prolonged power draw of about 280W, but the 2080ti will peak at about 400W during boost. A good, flawless 600W PSU should be sufficient but a degraded/low quality one is not.

You should update your bios, seems to be a very old version.