Originally published at: https://developer.nvidia.com/blog/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async/
Today’s leading-edge high performance computing (HPC) systems contain tens of thousands of GPUs. In NVIDIA systems, GPUs are connected on nodes through the NVLink scale-up interconnect, and across nodes through a scale-out network like InfiniBand. The software libraries that GPUs use to communicate, share work, and efficiently operate in parallel are collectively called NVIDIA Magnum…
Thanks Jim, Seth, Pak, Sreeram for this thoughtful piece addressing situations when applications use smaller message sizes as the workload scales to larger numbers of GPUs, its nice to see MagnumIO (GPUDIrect Async, NVSHMEM ) helps NICs to sustain high throughput on NVIDIA InfiniBand networks. Hint: GPU initiated communications bypassing the GPU bottleneck.
Thank you for the interesting article.
Do you intend to publish the lower-level RDMA api for GPU (i.e., InfiniBand GPUDirect Async) for users that wish to use RDMA without the shared-memory abstraction?
All I found is the libgdsync (GitHub - gpudirect/libgdsync: GPUDirect Async support for IB Verbs) which doesn’t seem to have had any development for the last years.
Best,
Lasse
As mentioned above- “ This was set with one thread per thread block and one QP (NIC queue pair, containing the WQ and CQ) per thread block.“。How to use different QPs in different thread blocks?