Docker Tensorflow-gpu can't find device, as well as nvidia-smi "No device found"

Hi, I’ve acquired access to an AWS machine with a Tesla T4 GPU for machine learning, and after installing drivers necessary for the TensorFlow library, I’ve run with the next issue when trying to execute the tensorflow-gpu ready docker image:

docker: Error response from daemon: OCI runtime create failed: 
container_linux.go:346: starting container process caused "process_linux.go:449: container init caused
\"process_linux.go:432: running prestart hook 0 caused 
\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\n\\"\"": unknown.

Installed nvidia driver is the 418, and the docker version on the server is 19.03.4.

Running the nvidia-smi command yeilds

$ nvidia-smi
No devices were found

There was no error downloading or installing the drivers. The GPU appears on the lspci command. I’ve tried many of the attempted solutions I’ve found on these forums from people with similar problems, with no results.

Ran the nvidia-bug-report.sh (log attached to post) and from what I’ve seen there is the

Oct 22 17:35:21 kernel: NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x26:0xffff:1155)

error which could be a hardware issue ? Is that even possible for a delivered AWS machine ?

Thank you
nvidia-bug-report.log.gz (508 KB)

Solved the issue. Some how the headers and default driver config for the AWS machine are faulty or incompatible. So had to make a flushed reinstall.

Here are the steps I took:

  1. Completely purge everything from nvidia and cuda:
  2. To list all your nvidia packages
    dpkg -l | grep -i nvidia
    

    To list all your cuda packages

    dpkg -l | grep -i cuda
    

    To purge all nvidia and cuda packages

    sudo apt-get remove --purge '^nvidia-.*'
    sudo apt-get remove --purge '^cuda.*'
    

    IMPORTANT: if you are in a desktop environment you need to reinstall nvidia-commons and ubuntu-desktop, as well as some nouveau resetting, since those are necessary to use your monitor and log in with UI. If you are in a monitorless environment, this is not required.
    To have more details on this check https://askubuntu.com/questions/206283/how-can-i-uninstall-a-nvidia-driver-completely

  3. Reinstall nvidia drivers using the drivers ppa:
  4. sudo add-apt-repository ppa:graphics-drivers/ppa
    sudo apt update
    sudo apt upgrade
    sudo apt install nvidia-driver-<VERSION THAT YOU NEED>
    sudo reboot
    

    In my case, the 418 was the version I needed for my GPU. Nonetheless the driver ppa already knows what’s best for you, and that version number may be different from the one that nvidia-smi yeilds after installation.

After that, I reinstalled all the tools I needed for work and they worked perfectly