I has benchmarked the following two environments of Nvidia. I have used the GPU growth tactics to adapt to the DNN training for both the two environments.
gpus = tf.config.experimental.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True)
CUDA Driver 440.100/CUDA Toolkit 10.2/cuDNN 7.6.5
In this environment, the GPU Perf is tied up to P8 for the typical DNN sample models such as LeNet, AlexNet, Inception v3. And the training speed is very slow. A typical AlexNet needs 270 minutes.Due to the long time training, the related changing factor is temperature shooting from 33 to 60+(in the air conditioner ambiance) or 80 Celsius degrees(no air conditioning ambiance).
The classical CUDA 440.33(originally for Nvidia Tesla GPU) has a similar effect. The training duration is same. Sometimes, CUDA Driver 440.33 has an quite odd incompatibility with CUDA Toolkit 10.2.
CUDA Driver 450.57/CUDA Toolkit 10.2/cuDNN 8.0.1
In the latest environment, the training speed is 22 minutes, 10+ times faster than the environment CUDA Driver 440.100. The GPU performance flexibly changes from P8 to P2 while it has a DNN training task. The temperature grows from 33 to 60+ in the air conditioner ambiance. It is just like a hungry cat meeting a poor mouse. It is an order of magnitude growth. I get to know that the CUDA 450.57 has MIG(Multiple Instance GPUs). Since it has the above-mentioned faster training speed, I guess that the latest driver version has a magic flexible capability.
However, the newer system has the CUPTI issue. Therefore, I need to run the sample AlexNet with the command.
Stand-alone Start in the Ubuntu Terminal:
$ python dnn.py --cap-add=CAP_SYS_ADMIN
Start in Jupyter Notebook:
Insert the code of lines into the last cell of Jupyter Notebook in order to remove the GPU process upon completing the training. Otherwise, it definitely has the issue of NVRM Xid 61/GPU FAN ERR & Pwr: Usage/CAP ERR .
from numba import cuda cuda.select_device(0) cuda.close()
I can observe the huge difference between the two CUDA environments because I have several Nvidia GPUs. So I migrate all my systems to the latest CUDA Driver 450.57. It is my observation.I hope that it will be useful for your thinking.