I installed the PyTorch cuda12.8 with Belta version, which come from “pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128”
It successfully installed a torch shown below:
torch 2.7.0.dev20250218+cu128
torchaudio 2.6.0.dev20250219+cu128
torchvision 0.22.0.dev20250219+cu128
It successfully installed a NCCL (2.25.1+cuda12.8) shown below:
dpkg-query --showformat=‘${Package} ${Version}\n’ --show libnccl2 libnccl-dev
libnccl-dev 2.25.1-1+cuda12.8
libnccl2 2.25.1-1+cuda12.8
However, when I run a code, it shows two errors.
First error: Why does the NCCL INFO show the NCCL version as 2.25.1+cuda12.2, as shown in the image below? Why?
Second error: It shows the second error as “Cuda failure 1 ‘invalid argument’”.
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3384, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 1 ‘invalid argument’
Why does the NCCL INFO show the NCCL version as 2.25.1+cuda12.2? Isn’t the cuda12.8 PyTorch?
How to solve the problem of “Cuda failure 1 ‘invalid argument’”?
Please help me! Thank you!