CUDA + PyTorch + Package Manger on GH200

What is the current best practice for running PyTorch on the GH200 with package management?

The ideal solution would be to have the PyTorch installation within a conda environment, but this is not yet available, as mentioned here.

Another alternative is the NGC container, but package installations aren’t persistent when working with the Singularity .sif format. I can decompose the .sif into a writable sandbox, i.e. singularity build --sandbox pytorch pytorch.sif, but I don’t have write access to the site-packages directory for pip installations. In any case, I would prefer not to continually rebuild between the .sif after each package installation, and maintaining the sandbox version makes a dent in the file limit of my cluster directory.

A third option would be to pip install locally the wheels found here, but this doesn’t include, e.g., torchvision.

Lastly, I tried building PyTorch from source in a new conda environment, but quickly ran into issues with the build process when using the compiler in the HPC SDK.

You might wish to ask on discuss.pytorch.org

The ptrblck there is a NVIDIA expert on pytorch, and is pretty active there.