NVSHMEM applicability in setting of two PCIe connected GPUs

Daniel_Wong · April 22, 2021, 7:45pm

Hi, All

I am currently working on a project related to NVSHMEM. The current hardware setting I use is two Titan Xp GPUs without SLI, and these two GPUs are installed on the same host via PCIe 3.0x16, I am not sure whether it is the correct setup for using the NVSHMEM for single machine cross GPUs communication.

Thanks!

spotluri · April 23, 2021, 1:11pm

NVSHMEM is supported only on data center class GPUs of Volta architecture and latter, primarily due to its reliance on GPU features like independent thread scheduling and communication technologies like P2P and RDMA: https://docs.nvidia.com/hpc-sdk/nvshmem/pdf/NVSHMEM-Installation-Guide.pdf.

Titan Xp which is based on Pascal does support CUDA P2P over PCIe, if they are connected on the same CPU socket (you can test using p2pBandwidthLatencyTest from CUDA samples). With this you can test simple NVSHMEM examples (like shmem_p_bw under perftest) that use p/g/put/get APIs (on_stream and device variants). But performance will likely be quite poor and more complex apps can end up in deadlocks.

Daniel_Wong · April 23, 2021, 3:56pm

Thanks for your reply.

When I run my NVSHMEM application on a DGX (4 Tesla V100 16GB), it can shows me message like

src/comm/transports/ibrc/ibrc.cpp: NULL value get_device_list failed

from the official Q&A, it says.

A: This occurs when ibverbs library is present on the system but the library is not able to detect any InfiniBand devices on the system. Make sure that the InfiniBand devices are available and/or are in a working state.

but it seems that the program can successfully run even if it shows me this message. I am only using 4 GPUs on the same CPU host.

Will this impact the performance?

Thanks

alanger · April 23, 2021, 4:17pm

Yes, if the GPUs are P2P/PCIe connected, you don’t need IB, and it wont impact performance.
To get rid of the warning, you can set the runtime environment variable NVSHMEM_REMOTE_TRANSPORT=“none” if you are using NVSHMEM 2.1.2.

Daniel_Wong · April 23, 2021, 4:46pm

Thanks for your helpful suggestion.

I have question about the overhead of using the blocking GET and non-blocking GET. When we use the GET, it basically have two phase, one for send the request and the second phase for waiting for the results coming back, while non-blocking GET we still need to use the quiet or barrier for synchronization if we need those non-blocking GET to finish.
is there any statistics of the latency in cycles for each of these blocking/nonblocking GET APIs?

if each time I transfer a batch of data (like a matrix) instead of a single vector, will the transferring performance of NVSHMEM will be improved? I think there might be some tradeoff between the transferring performance (throughput) versus the overlapping of computation and transferring. (i.e. more coarse-grained bulky data transferring would benefit the throughput while more fine-grained data transferring would improve the overlapping of computation and transferring)
does this claim make sense?

Also compare GET with PUT, does PUT comes with lower overhead compared with GET?

Thanks a lot for your great help!

alanger · May 3, 2021, 5:03pm

Hi Wong,

The answer would depend on the platform you are running on. If you are running just on NVLink/PCIe connected GPUs - blocking and non-blocking gets are implemented exactly the same way as of now. They translate to load instructions. To get completion of non-blocking gets, you will call nvshmem_quiet and that is basically a __threadfence_system() on the GPU (In practice, in the current implementation, this would be a redundant call if you are only using non-blocking gets on P2P connected GPUs).
When the GPUs are connected via IB, certainly non-blocking gets will be faster as they can be pipelined and then tracked for completion at once using nvshmem_quiet.
Again it depends on how GPUs are connected. When GPUs are connected via NVLink, fine-grained transfers are efficient (especially if they are accesses to contiguous addresses across warp, hardware can coalesce them) and can be used to overlap computation with communication. But when the GPUs are connected via IB, fine-grained transfers are not efficient and it is recommended to do bulk transfers.
GET on IB should have higher latency because of round-trip. Bandwidth-wise on IB, get and put should give you the same bandwidth. For NVLink/PCIe, put and get translate to store and load instructions. You can compare their performance using CUDA samples or using NVSHMEM perftests.

Topic		Replies	Views
What is the setting to enable NVSHMEM on multi-node multi-GPU platform with IB GPU-Accelerated Libraries nvshmem	7	170	January 2, 2025
NVSHMEM on multi-node GPUs GPU-Accelerated Libraries cuda , nvshmem	8	2676	January 18, 2024
NVSHMEM on multi-node GPUs failed . My gpu is A5000 GPU-Accelerated Libraries nvshmem	5	961	April 1, 2024
P2P Transfers Across Single PCIe Switch Fail CUDA Programming and Performance	5	1380	April 15, 2024
NVSHMEM runtime error GPU-Accelerated Libraries nvshmem	11	1871	August 16, 2022
Does Titan RTX support P2P access w/o NVLink? CUDA Programming and Performance	9	3823	December 15, 2019
Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async Technical Blog	3	643	March 10, 2025
simpleP2P example and multi-GPU network training causes system freeze and ERR in nvidia-smi Linux	7	3880	October 14, 2021
Optimal multi-GPU system CUDA Programming and Performance	7	1088	September 6, 2017
Error when running NVSHMEM perftest GPU-Accelerated Libraries nvshmem	3	250	January 16, 2025

NVSHMEM applicability in setting of two PCIe connected GPUs

Related topics