Any PyTorch versions supporting torch.distributed and nccl backend on jetson orin nano?

I’ve tried wheel torch-1.11 from PyTorch for Jetson - Jetson & Embedded Systems / Announcements - NVIDIA Developer Forums on jetson orin nano, indeed torch.cuda.is_available() is True and torch.distributed.is_mpi_available() is True, but when I transmitting tensor on GPU using torch.distributed with backend of mpi, following erroe occurred:

RuntimeError: CUDA tensor detected and the MPI used doesn’t have CUDA-aware MPI support

I want to ask, what is the easiest approach to support distributed communication for CUDA tensors? As mentioned before, do you mean NCCL can be available for jp6.1 or 6.2?
And for current system (jp 5.1.2 and cuda11.4 of jetson orin nano), is it the only way to support communication of CUDA tensors is by moving them to cpu then transmit to other device and reload to gpu?
Greatly thanks for your reply!!!