Conda environments: The pytorch and nvidia channels aren't playing nicely together and the nvidia channel is out of date

If I set up a conda pytorch environment like this:

conda activate pytorch-cuda
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

That works; at least insofar as being able to import torch in python. If, however, I add cuDNN:

conda install cudnn -c nvidia

Things are no longer warm and fuzzy:

(torch-cuda1) pgoetz@finglas ~$ python --version
Python 3.11.5
(torch-cuda1) pgoetz@finglas ~$ python
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lusr/opt/miniconda/envs/torch-cuda1/lib/python3.11/site-packages/torch/__init__.py", line 229, in <module>
from torch._C import *  # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: /lusr/opt/miniconda/envs/torch-cuda1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so: undefined symbol: cudaMemPoolSetAttribute, version libcudart.so.11.0
>>> 

What’s happening is the cuDNN conda package is installing and relinking an older version of libcudart.so.11.0. Here is what is in /miniconda/envs/pytorch-cuda/lib before cuDNN is installed:

# ls -l libcudart*
-rwxr-xr-x 3 root root 695712 Sep 21  2022 libcudart.so.11.8.89

Here is what it looks like after the cudnn package is installed from the nvidia channel:

# ls -l libcudart*
lrwxrwxrwx 1 root root     20 Sep 25 13:12 libcudart.so -> libcudart.so.11.1.74
lrwxrwxrwx 1 root root     20 Sep 25 13:12 libcudart.so.11.0 -> libcudart.so.11.1.74
-rwxr-xr-x 2 root root 554032 Oct 14  2020 libcudart.so.11.1.74
-rwxr-xr-x 3 root root 695712 Sep 21  2022 libcudart.so.11.8.89

It looks like something similar is happening with libcusparse.so.11, and possibly other libraries, I didn’t bother trying to track them all down.

It looks like the cuDNN packages in the nvidia conda channel are extremely out of date:

(torch-cuda1) pgoetz@finglas ~$ conda search -c nvidia cudnn
Loading channels: done
# Name                       Version           Build  Channel             
cudnn                          7.0.5       cuda8.0_0  pkgs/main           
cudnn                          7.1.2       cuda9.0_0  pkgs/main           
cudnn                          7.1.3       cuda8.0_0  pkgs/main           
cudnn                          7.2.1       cuda9.2_0  pkgs/main           
cudnn                          7.3.1      cuda10.0_0  pkgs/main           
cudnn                          7.3.1       cuda9.0_0  pkgs/main           
cudnn                          7.3.1       cuda9.2_0  pkgs/main           
cudnn                          7.6.0      cuda10.0_0  nvidia              
cudnn                          7.6.0      cuda10.0_0  pkgs/main           
cudnn                          7.6.0      cuda10.1_0  nvidia              
cudnn                          7.6.0      cuda10.1_0  pkgs/main           
cudnn                          7.6.0       cuda9.0_0  pkgs/main           
cudnn                          7.6.0       cuda9.2_0  nvidia              
cudnn                          7.6.0       cuda9.2_0  pkgs/main           
cudnn                          7.6.4      cuda10.0_0  pkgs/main           
cudnn                          7.6.4      cuda10.1_0  pkgs/main           
cudnn                          7.6.4       cuda9.0_0  pkgs/main           
cudnn                          7.6.4       cuda9.2_0  pkgs/main           
cudnn                          7.6.5      cuda10.0_0  pkgs/main           
cudnn                          7.6.5      cuda10.1_0  pkgs/main           
cudnn                          7.6.5      cuda10.2_0  pkgs/main           
cudnn                          7.6.5       cuda9.0_0  pkgs/main           
cudnn                          7.6.5       cuda9.2_0  pkgs/main           
cudnn                          8.0.0      cuda10.2_0  nvidia              
cudnn                          8.0.0      cuda11.0_0  nvidia              
cudnn                          8.0.4      cuda10.1_0  nvidia              
cudnn                          8.0.4      cuda10.2_0  nvidia              
cudnn                          8.0.4      cuda11.0_0  nvidia              
cudnn                          8.0.4      cuda11.1_0  nvidia              
cudnn                          8.2.1      cuda11.3_0  pkgs/main           
cudnn                       8.9.2.26        cuda11_0  pkgs/main           

which is likely the source of the problem. If I install cudnn v.8.9.2.26 from main, then things seem to work; well, at least I can import torch without crashing out.

So, this is kind of a mess. I must install cuda from the nvidia channel (it’s not available elsewhere), but then should not use the nvidia channel for cudnn, where the main channel has, if not the newest, but a much newer version of these libraries. To make matters worse, the conda-forge channel also includes cudnn packages (through 8.8), and conda installs packages based on channel priority, so it’s pretty easy to mess this up. Thoughts on the best strategy for dealing with this?

Hi @pgoetz1 ,
Would you mind trying the Pytorch NGC container and let us know if this works?

Thanks

Hi -

Yes, sure. Where can I find that?

Hi @pgoetz1 ,

Please find the link for the same

Thanks