Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async

Originally published at: https://developer.nvidia.com/blog/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async/

Today’s leading-edge high performance computing (HPC) systems contain tens of thousands of GPUs. In NVIDIA systems, GPUs are connected on nodes through the NVLink scale-up interconnect, and across nodes through a scale-out network like InfiniBand. The software libraries that GPUs use to communicate, share work, and efficiently operate in parallel are collectively called NVIDIA Magnum…

Thanks Jim, Seth, Pak, Sreeram for this thoughtful piece addressing situations when applications use smaller message sizes as the workload scales to larger numbers of GPUs, its nice to see MagnumIO (GPUDIrect Async, NVSHMEM ) helps NICs to sustain high throughput on NVIDIA InfiniBand networks. Hint: GPU initiated communications bypassing the GPU bottleneck.