If I set up a conda pytorch environment like this:
conda activate pytorch-cuda
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
That works; at least insofar as being able to import torch in python. If, however, I add cuDNN:
conda install cudnn -c nvidia
Things are no longer warm and fuzzy:
(torch-cuda1) pgoetz@finglas ~$ python --version
Python 3.11.5
(torch-cuda1) pgoetz@finglas ~$ python
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/lusr/opt/miniconda/envs/torch-cuda1/lib/python3.11/site-packages/torch/__init__.py", line 229, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: /lusr/opt/miniconda/envs/torch-cuda1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so: undefined symbol: cudaMemPoolSetAttribute, version libcudart.so.11.0
>>>
What’s happening is the cuDNN conda package is installing and relinking an older version of libcudart.so.11.0. Here is what is in /miniconda/envs/pytorch-cuda/lib before cuDNN is installed:
# ls -l libcudart*
-rwxr-xr-x 3 root root 695712 Sep 21 2022 libcudart.so.11.8.89
Here is what it looks like after the cudnn package is installed from the nvidia channel:
# ls -l libcudart*
lrwxrwxrwx 1 root root 20 Sep 25 13:12 libcudart.so -> libcudart.so.11.1.74
lrwxrwxrwx 1 root root 20 Sep 25 13:12 libcudart.so.11.0 -> libcudart.so.11.1.74
-rwxr-xr-x 2 root root 554032 Oct 14 2020 libcudart.so.11.1.74
-rwxr-xr-x 3 root root 695712 Sep 21 2022 libcudart.so.11.8.89
It looks like something similar is happening with libcusparse.so.11, and possibly other libraries, I didn’t bother trying to track them all down.
It looks like the cuDNN packages in the nvidia conda channel are extremely out of date:
(torch-cuda1) pgoetz@finglas ~$ conda search -c nvidia cudnn
Loading channels: done
# Name Version Build Channel
cudnn 7.0.5 cuda8.0_0 pkgs/main
cudnn 7.1.2 cuda9.0_0 pkgs/main
cudnn 7.1.3 cuda8.0_0 pkgs/main
cudnn 7.2.1 cuda9.2_0 pkgs/main
cudnn 7.3.1 cuda10.0_0 pkgs/main
cudnn 7.3.1 cuda9.0_0 pkgs/main
cudnn 7.3.1 cuda9.2_0 pkgs/main
cudnn 7.6.0 cuda10.0_0 nvidia
cudnn 7.6.0 cuda10.0_0 pkgs/main
cudnn 7.6.0 cuda10.1_0 nvidia
cudnn 7.6.0 cuda10.1_0 pkgs/main
cudnn 7.6.0 cuda9.0_0 pkgs/main
cudnn 7.6.0 cuda9.2_0 nvidia
cudnn 7.6.0 cuda9.2_0 pkgs/main
cudnn 7.6.4 cuda10.0_0 pkgs/main
cudnn 7.6.4 cuda10.1_0 pkgs/main
cudnn 7.6.4 cuda9.0_0 pkgs/main
cudnn 7.6.4 cuda9.2_0 pkgs/main
cudnn 7.6.5 cuda10.0_0 pkgs/main
cudnn 7.6.5 cuda10.1_0 pkgs/main
cudnn 7.6.5 cuda10.2_0 pkgs/main
cudnn 7.6.5 cuda9.0_0 pkgs/main
cudnn 7.6.5 cuda9.2_0 pkgs/main
cudnn 8.0.0 cuda10.2_0 nvidia
cudnn 8.0.0 cuda11.0_0 nvidia
cudnn 8.0.4 cuda10.1_0 nvidia
cudnn 8.0.4 cuda10.2_0 nvidia
cudnn 8.0.4 cuda11.0_0 nvidia
cudnn 8.0.4 cuda11.1_0 nvidia
cudnn 8.2.1 cuda11.3_0 pkgs/main
cudnn 8.9.2.26 cuda11_0 pkgs/main
which is likely the source of the problem. If I install cudnn v.8.9.2.26 from main, then things seem to work; well, at least I can import torch without crashing out.
So, this is kind of a mess. I must install cuda from the nvidia channel (it’s not available elsewhere), but then should not use the nvidia channel for cudnn, where the main channel has, if not the newest, but a much newer version of these libraries. To make matters worse, the conda-forge channel also includes cudnn packages (through 8.8), and conda installs packages based on channel priority, so it’s pretty easy to mess this up. Thoughts on the best strategy for dealing with this?