Hi, I’ve acquired access to an AWS machine with a Tesla T4 GPU for machine learning, and after installing drivers necessary for the TensorFlow library, I’ve run with the next issue when trying to execute the tensorflow-gpu ready docker image:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\n\\"\"": unknown.
Installed nvidia driver is the 418, and the docker version on the server is 19.03.4.
Running the nvidia-smi command yeilds
$ nvidia-smi No devices were found
There was no error downloading or installing the drivers. The GPU appears on the lspci command. I’ve tried many of the attempted solutions I’ve found on these forums from people with similar problems, with no results.
Ran the nvidia-bug-report.sh (log attached to post) and from what I’ve seen there is the
Oct 22 17:35:21 kernel: NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x26:0xffff:1155)
error which could be a hardware issue ? Is that even possible for a delivered AWS machine ?
nvidia-bug-report.log.gz (508 KB)