Hi,
When I run inference with docker based on nvcr.io/nvidia/tensorrt:19.02-py2, the driver lost after several minutes normal working. After the driver lost, I run nvidia-smi and got log
Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU
The GPU is nvidia-2080ti
the dmesg report is
[ 98.639487] docker0: port 1(veth2d6c917) entered blocking state
[ 98.639492] docker0: port 1(veth2d6c917) entered forwarding state
[ 98.639568] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
[ 99.530962] eth0: renamed from veth4e784b8
[ 99.559491] IPv6: ADDRCONF(NETDEV_CHANGE): veth7768fff: link becomes ready
[ 99.559586] docker0: port 2(veth7768fff) entered blocking state
[ 99.559591] docker0: port 2(veth7768fff) entered forwarding state
[ 100.193772] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 239
[ 146.982594] docker0: port 3(veth45f1ef9) entered blocking state
[ 146.982595] docker0: port 3(veth45f1ef9) entered disabled state
[ 146.982625] device veth45f1ef9 entered promiscuous mode
[ 146.982684] IPv6: ADDRCONF(NETDEV_UP): veth45f1ef9: link is not ready
[ 150.361999] eth0: renamed from veth46afce8
[ 150.374412] IPv6: ADDRCONF(NETDEV_CHANGE): veth45f1ef9: link becomes ready
[ 150.374504] docker0: port 3(veth45f1ef9) entered blocking state
[ 150.374508] docker0: port 3(veth45f1ef9) entered forwarding state
[ 363.263945] ata1: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
[ 363.263948] ata1: irq_stat 0x00400040, connection status changed
[ 363.263950] ata1: SError: { HostInt PHYRdyChg 10B8B DevExch }
[ 363.263952] ata1: hard resetting link
[ 363.714451] NVRM: GPU at PCI:0000:01:00: GPU-fdd57c3d-31d6-31f9-9313-21b3ab2a8112
[ 363.714456] NVRM: GPU Board Serial Number:
[ 363.714458] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 363.714461] NVRM: GPU at 00000000:01:00.0 has fallen off the bus.
[ 363.714461] NVRM: GPU is on Board .
[ 363.714477] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 366.530746] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 366.542075] ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[ 366.542076] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[ 366.542078] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[ 366.636809] ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[ 366.636811] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[ 366.636812] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[ 366.637330] ata1.00: configured for UDMA/100
[ 366.637332] ata1: EH complete
And the nvidia-bug-report.sh report is as below
nvidia-bug-report.log.gz (1.01 MB)