Moving docker+gpu images between Ubuntu native and Win 11/WSL2

Hi all,

I am trying to create Docker images, with pytorch and gpu enabled, which I can move between Ubuntu native and Win 11/WSL2, and I am hitting some issues.

My setup involves two computers (z-books):

  1. Ubuntu native, docker, nvidia drivers and toolkit and container installed.
  2. Win11, WSL2, docker installed following the instructions available here: CUDA on WSL :: CUDA Toolkit Documentation

I can successfully create docker images in both and run with gpu support enabled (docker run --gpus all …). “nvidia-smi” works in both. Installing torch in both, and testing “torch.cuda.is_available()” returns True for both.

What I want to do is create and image in computer 1 with gpu and pytorch installed, save it (docker save …), move the saved image to computer 2, load it (docker load …), and test that it runs in computer 2 with gpu working as above. Similarly from computer 2 to computer 1. But it doesn’t work without hacks.

This is what happens from computer 1 to 2:

  • create and image in computer 1 with gpu and pytorch installed, run with “docker run --gpus all …”, and test with “torch.cuda.is_available()”: WORKS
  • save image in computer 1 (docker save…), load in computer 2 (docker load…): WORKS
  • run in computer 2 with “docker run --gpus all …”: FAILS

This is what happens from computer 2 to 1:

  • create and image in computer 2 with gpu and pytorch installed, run with “docker run --gpus all …”, and test with “torch.cuda.is_available()”: WORKS
  • save image in computer 2 (docker save…), load in computer 1 (docker load…): WORKS
  • run in computer 1 with gpu flags (“docker run --gpus all …)”: WORKS
  • test the gpu is working properly in computer 1 with “torch.cuda.is_available()”: FAILS

The problem is the symbolic links “/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1” and “/usr/lib/x86_64-linux-gnu/libcuda.so.1”.

Hack fix for 1 to 2:

  • run in computer 2 without gpu flags (docker run -it …): WORKS
  • remove the symbolic links libnvidia-ml.so.1 and libcuda.so.1: “cd /usr/lib/x86_64-linux-gnu” followed by “rm libnvidia-ml.so.1” followed by “rm libcuda.so.1”: WORKS
  • save as a new image (docker commit -m …): WORKS
  • run the new image in computer 2 with the gpu flags (docker run --gpus all …): WORKS
  • test the gpu is working properly in computer 2 with “torch.cuda.is_available()”: WORKS

Hack fix for 2 to 1:

  • run the image on computer 1 with gpu flags (docker run --gpus all …): WORKS
  • create the symbolic links libnvidia-ml.so.1 and libcuda.so.1: “cd /usr/lib/x86_64-linux-gnu” followed by “ln -s libnvidia-ml.so.470.74 libnvidia-ml.so.1” followed by “ln -s libcuda.so.470.74 libcuda.so.1”: WORKS
  • test the gpu is working properly in computer 1 with “torch.cuda.is_available()”: WORKS

If anyone understand why this is happening, and a better solution, please let me know.