Cuda:10.0-cudnn7-devel-ubuntu16.04 - Facing Issues

HOST Environment:
Host OS: Ubuntu 16.04
Docker Image : cuda:10.0-cudnn7-devel-ubuntu16.04
Total GPUs : 4 nos Tesla V100 - GPU Memory 16.2 GB
CUDA: 10.2
Driver : 440.65
Tensorflow-gpu - 1.15
keras: 2.1.3

I have run the container using the Docker Image : cuda:10.0-cudnn7-devel-ubuntu16.04 on the Third party cloud host.

When i invoke nvidia-smi inside the container,

nvidia-smi shows as below - that is all GPUs Utilisation is above 90% for few seconds

Then nvidia-smi window shows as follows, that is all GPUs Utilisation is 0% for few seconds

Then nvidia-smi window shows as follows, that is Utilisation% is random in all GPUs Utilisation for few seconds

I have noticed Training performance is low and taking longer duration in Docker container mentioned here. Same code is working fine in Tesla K80 2 GPUs with CUDA 10.0 on a Dedicated server as shown below.

Screen Shot 2020-10-10 at 9 34 06 PM

But in Docker container, Why GPU utilization% is keep changing unusually

Expected:

Question 1: i am seeing the CUDA 10.2 version with driver version 440.65.00 instead of CUDA 10.0, Why?

Question 2: There is no process list even though GPU Util % is above 90% in all 4 GPUs, Why?

Question 3: Why GPU utilization% is keep changing unusually?

Question 4: My code will work on tensorflow-gpu ==1.15.0 and keras==2.13 i can’t change to CUDA 10.2 based Ubuntu image
How could i solve this issue with the container having the Docker image cuda:10.0-cudnn7-devel-ubuntu16.04?

Question 5: Could i install CUDA 10.0 explicitly insider the container running Docker image cuda:10.0-cudnn7-devel-ubuntu16.04?

Question 6: Could i install CUDA driver between >= 384.111, < 385.00 explicitly insider the container running Docker image cuda:10.0-cudnn7-devel-ubuntu16.04?

Hi there,

I’m providing some answers here:

  1. The version printed in the nvidia-smi output is the maximum version of CUDA supported by the driver. In this case, R440 was released along side CUDA 10.2. R440 is backwards compatible against all lower CUDA releases.
  2. What were you running (before) when you observed the 90%+ utilization output in nvidia-smi? The utilization output from nvidia-smi is not cycle accurate.
  3. This is related to Q2
  4. I’m not sure what you mean. Can you please rephrase? As I mentioned, R440 will support all older CUDA versions including CUDA 10.0
  5. Yes you can - but what is the use-case and what do you want to achieve?
  6. No please don’t install drivers inside containers. This makes the container non-portable and defeats the purpose of containerization in the first place. Why do you want to do this?