Hi all,
I am trying to create Docker images, with pytorch and gpu enabled, which I can move between Ubuntu native and Win 11/WSL2, and I am hitting some issues.
My setup involves two computers (z-books):
- Ubuntu native, docker, nvidia drivers and toolkit and container installed.
- Win11, WSL2, docker installed following the instructions available here: CUDA on WSL :: CUDA Toolkit Documentation
I can successfully create docker images in both and run with gpu support enabled (docker run --gpus all …). “nvidia-smi” works in both. Installing torch in both, and testing “torch.cuda.is_available()” returns True for both.
What I want to do is create and image in computer 1 with gpu and pytorch installed, save it (docker save …), move the saved image to computer 2, load it (docker load …), and test that it runs in computer 2 with gpu working as above. Similarly from computer 2 to computer 1. But it doesn’t work without hacks.
This is what happens from computer 1 to 2:
- create and image in computer 1 with gpu and pytorch installed, run with “docker run --gpus all …”, and test with “torch.cuda.is_available()”: WORKS
- save image in computer 1 (docker save…), load in computer 2 (docker load…): WORKS
- run in computer 2 with “docker run --gpus all …”: FAILS
This is what happens from computer 2 to 1:
- create and image in computer 2 with gpu and pytorch installed, run with “docker run --gpus all …”, and test with “torch.cuda.is_available()”: WORKS
- save image in computer 2 (docker save…), load in computer 1 (docker load…): WORKS
- run in computer 1 with gpu flags (“docker run --gpus all …)”: WORKS
- test the gpu is working properly in computer 1 with “torch.cuda.is_available()”: FAILS
The problem is the symbolic links “/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1” and “/usr/lib/x86_64-linux-gnu/libcuda.so.1”.
Hack fix for 1 to 2:
- run in computer 2 without gpu flags (docker run -it …): WORKS
- remove the symbolic links libnvidia-ml.so.1 and libcuda.so.1: “cd /usr/lib/x86_64-linux-gnu” followed by “rm libnvidia-ml.so.1” followed by “rm libcuda.so.1”: WORKS
- save as a new image (docker commit -m …): WORKS
- run the new image in computer 2 with the gpu flags (docker run --gpus all …): WORKS
- test the gpu is working properly in computer 2 with “torch.cuda.is_available()”: WORKS
Hack fix for 2 to 1:
- run the image on computer 1 with gpu flags (docker run --gpus all …): WORKS
- create the symbolic links libnvidia-ml.so.1 and libcuda.so.1: “cd /usr/lib/x86_64-linux-gnu” followed by “ln -s libnvidia-ml.so.470.74 libnvidia-ml.so.1” followed by “ln -s libcuda.so.470.74 libcuda.so.1”: WORKS
- test the gpu is working properly in computer 1 with “torch.cuda.is_available()”: WORKS
If anyone understand why this is happening, and a better solution, please let me know.