Currently I am working of P.O.C for a new project which requires the usage of HPC over multi nodes .
One of the requirements is multi GPU shared memory from CUDA kernel.
After reviewing the available technologies such a we decided to focus on the following 2 techs:
- Nvshmem + MPI
- Nccl + MPI
For each tech there is a advantage and dis advantage.
For Nvshmem there is an option to use symetric memory over multi nodes which give us an option to access another GPU memory space from CUDA kernel.
While the major disadvantage is lack of half precision support.
In contra the Nccl has support for half precision but lack the capability of accessing another GPU memory from a CUDA kernel.
From my point of view the chosen tech would be Nvshmem only if there is an option for half precision support.
Otherwise I would have to use Nccl while I would not able to implement an access to another GPU node from CUDA kernel.
In order to get the decision which technology I I will chose I would like to know if there is any plan to add support for half precision at Nccl.