After updating from Cuda 8.0 and cudnn 6.0 to Cuda 9.0 and cudnn 7.0 and updating the driver to current version 384.111 I can no longer execute scripts that use a convolution.
Executing scripts using RNN’s or fully connected models with PyTorch (v0.3) and TensorFlow (v1.5) works fine, however if the model contains a convolutional layer in either framework the script fails.
Using PyTorch the error is:
With TensorFlow the error is:
2018-01-28 15:07:26.985395: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-01-28 15:07:26.985424: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2018-01-28 15:07:26.985432: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)
Aborted (core dumped)
Furthermore if the PyTorch script with convolutional layers is executed as superuser using “sudo $(which python) script.py” it works as expected. But this workaround doesn’t work with TensorFlow because some links are broken, it probably does work in PyTorch because it is distributed with its own binaries for Cuda and cudnn.
This leads me to believe that the issue is with the driver’s installation, however after several attempts at reinstalling the driver and downgrading back to version 384.90, it still doesn’t work.
nvidia-smi is also working fine and displays the correct information.
I am out of ideas. Is there anything I could try to fix this?
Note: I am using python 3.5 with anaconda and I’ve tested this in several environments and tried reinstalling and building from source both frameworks.
Both Cuda and the drivers were installed using the runfiles.
System: Ubuntu 16.04
GPU: GTX 1080