Hi,
I have a long-standing instance on AWS EC2 that is GPU-enabled. This instance functions properly but it loses access to the GPU after a while and I don’t know why.
The physical GPU card is still present on the instance (verified with sudo lshw -C display
on both 1) the problematic instance and 2) a new one created with the same configuration.
A reboot fixes the problem. However, I’d like to avoid having to do that and treat this issue.
I am attaching a report obtained with sudo nvidia-bug-report.sh
and some additional information on the instance configuration. The log seems very long, apologies in advance.
Information:
AWS Instance Image: Ubuntu 20.04 LTS, SSD Volume Type (ami-0261755bbcb8c4a84)
GPU Card: TU104GL [Tesla T4]
Driver Version: NVIDIA-Linux-x86_64-418.226.00.run.1
dkms status nvidia output:
nvidia, 515.48.07, 5.15.0-1039-aws, x86_64: installed
nvidia, 515.48.07, 5.15.0-1047-aws, x86_64: installed
sudo lshw -C display output:
*-display:1
description: 3D controller
product: TU104GL [Tesla T4]
vendor: NVIDIA Corporation
physical id: 1e
bus info: pci@0000:00:1e.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msix bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:80-7f iomemory:80-7f irq:10 memory:fd000000-fdffffff memory:840000000-84fffffff memory:850000000-851ffffff
nvidia-bug-report.log.gz (24.8 MB)
Thank you very much!
PS. I couldn’t find the appropriate place for this discussion so I added the closest I could find. Please, let me know if I should move it somewehere else.