I have recently purchased 1660 super graphic card. I have installed the graphic card in my ubuntu 18.04 linux system. But i am not able to use the graphic card for my deep learning programmes . I am currently using Anaconda jupyter notebook with python 3.6, keras 2.3.1, tensorflow 2.0, tensorflow-gpu 2.0, cudnn 7.6.4, cudatoolkit 10.0.130 and nvidia driver 410.
Using above drivers and packages, i am not able to run my code the error which i am getting is :
“Failed to get convolution algorithm. This is probably because cuDNN failed to initialize” and sometimes “out of memory” also.
Please let me know how can i solve the issue specifically with respect to anaconda navigator
As a quick smoke test, can you check that the
nvidia-smi command works in the terminal?
If that works with no errors, then perhaps you’re running out of memory during your application. You can run
watch -n 0.1 nvidia-smi in a separate shell while your app is running to see if the memory looks like it approaches the maximum before the error.
I ran the command in separate shell. The memory is getting full to 6gb and error also appeard simultaneously. It’s kind of sudden overshoot from 65 mb to 6 gb of memory. But how this can happen? It seems like with only 100mb of data the memory is full.
I can’t say for sure without knowing what code you’re running, but the sudden jump I would assume is loading some dataset or model into memory, which is larger than your available 6GB and hence the OOM error. I may be able to help more if you share your scripts, but in general this doesn’t seem like a bug, just seems like your GPU doesn’t have enough memory for the task you’re trying to accomplish.
The dataset size is small, around 163mb. Please find below code which i am trying to run with above mentioned versions of packages. The below code is working with CPU, but creates issues when run with GPU.
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
import numpy as np
# batch, classes, epochs
batch_size = 32
num_classes = 10
epochs = 50
# The data, split between train and test sets:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
print('x_train shape:', x_train.shape)
print(x_train.shape, 'train samples')
print(x_test.shape, 'test samples')
# Convert class vectors to binary class matrices.
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
# model architecture
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',
model.add(Conv2D(32, (3, 3)))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Conv2D(64, (3, 3)))
# compile the model
# convert to float, normalise the data
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
I just ran your code and confirmed the model is only using about ~1GB of GPU memory. Tensorflow by default allocates almost all of the GPU memory right at the start. If you have other processes running using any GPU memory, that might make it run out.
You can set the config to dynamically grow GPU memory as needed, and this way you shouldn’t run out unless the model actually requires more than you have.
Try adding this code snippet at the top of your script.
For TF1, using NGC container “nvcr.io/nvidia/tensorflow:19.10-py3”:
root@3efd20740a2a:/mnt# python -m pip freeze | grep -i -e tensorflow -e keras
import tensorflow as tf
gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
Source: Allowing GPU memory growth command does not work · Issue #11584 · keras-team/keras · GitHub
For TF2, using NGC container “nvcr.io/nvidia/tensorflow:19.11-tf2-py3”:
root@c03f96d089ad:/mnt# python -m pip freeze | grep -i -e tensorflow -e keras
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
Source: Tensorflow v2 Limit GPU Memory usage · Issue #25138 · tensorflow/tensorflow · GitHub
Can you please let me know, what Nvidia driver version should i use for 1660 super, cudnn and cuda for the docker TF1 or TF2. As currently i have installed the latest nvidia driver 440.3, cudnn 7.6.4 and cudatoolkit 10.0.130 ? I think this might be conflicting.
I don’t believe it will conflict, though I can’t say for sure. Can you share the commands you’re running and the corresponding full errors you’re getting?
i did fresh installation as mentioned below :
I downloaded the avaialble driver for 1660 super from nvidia website and installed as per the instructions.
Then i installed cuda 10.2 as per the provided instructions on the download page.
Copied the cudnn 7.6.4 files in the lib folder as mentioned on nvidia cudnn page.
After this installation , i created a new environment in Anaconda navigator with python 3.6, tensorflow-gpu 2.0
Tried running the code by importing keras from tensorflow as mentioned in tensorflow-gpu2.0 user guide.
But still ended in the same issue while running model.fit : “Failed to get convolution algorithm. This is probably because cuDNN failed to initialize”
I tried with the suggested add on to the start of my script, but still same error.
Please let me know if i am doing something wrong here, or some other procedure. Also let me know whether the device 1660 super really supports cuda/cudnn/tensorflow.
i used the ngc container : nvcr.io/nvidia/tensorflow:19.11-tf2-py3"
i ran my same code, and got the exact same error :
“”"tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node sequential/conv2d/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_distributed_function_1075]
Function call stack:
root@980a8ddaa84e:/mnt# python -m pip freeze | grep -i -e tensorflow -e keras
Mon Dec 16 16:46:47 2019
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 GeForce GTX 166… On | 00000000:01:00.0 On | N/A |
| 0% 45C P8 11W / 125W | 424MiB / 5941MiB | 1% Default |
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
hi, Thanks for the help.
Finally worked—with below steps :
- removed all the nvidia drivers : sudo apt-get remove --purge nvidia*
- removed all the cuda versions : sudo apt-get remove --purge cuda*
- manually deleted the cuda folders from : /usr/local
- pc reboot
- downloaded the driver from nvidia : Linux x64 (AMD64/EM64T) Display Driver | 440.44 | Linux 64-bit | NVIDIA
version : 440.44
- installed the driver, did system restart.
- Tested driver with : nvidia-smi
- Added the repositories :
sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update
- Then installed cuda with command :
sudo apt-get install --no-install-recommends cuda-10-0
downloaded the cudnn version and samples : cudnn-10.0-linux-x64-v184.108.40.206.tgz, libcudnn7-doc_220.127.116.11-1+cuda10.0_amd64
copied files using the instructions provided : Installation Guide :: NVIDIA Deep Learning cuDNN Documentation
Added following path in bashrc :
tested the cudnn
15 test passed successfully
16 downloaded the container : 19.11-tf1-py3
17.ran the container
- Below are the packages installed :
- ran the same code.