NVSHMEM applicability in setting of two PCIe connected GPUs

Hi, All

I am currently working on a project related to NVSHMEM. The current hardware setting I use is two Titan Xp GPUs without SLI, and these two GPUs are installed on the same host via PCIe 3.0x16, I am not sure whether it is the correct setup for using the NVSHMEM for single machine cross GPUs communication.

Thanks!

NVSHMEM is supported only on data center class GPUs of Volta architecture and latter, primarily due to its reliance on GPU features like independent thread scheduling and communication technologies like P2P and RDMA: https://docs.nvidia.com/hpc-sdk/nvshmem/pdf/NVSHMEM-Installation-Guide.pdf.

Titan Xp which is based on Pascal does support CUDA P2P over PCIe, if they are connected on the same CPU socket (you can test using p2pBandwidthLatencyTest from CUDA samples). With this you can test simple NVSHMEM examples (like shmem_p_bw under perftest) that use p/g/put/get APIs (on_stream and device variants). But performance will likely be quite poor and more complex apps can end up in deadlocks.

Thanks for your reply.

When I run my NVSHMEM application on a DGX (4 Tesla V100 16GB), it can shows me message like

src/comm/transports/ibrc/ibrc.cpp: NULL value get_device_list failed

from the official Q&A, it says.

A: This occurs when ibverbs library is present on the system but the library is not able to detect any InfiniBand devices on the system. Make sure that the InfiniBand devices are available and/or are in a working state.

but it seems that the program can successfully run even if it shows me this message. I am only using 4 GPUs on the same CPU host.

Will this impact the performance?

Thanks

Yes, if the GPUs are P2P/PCIe connected, you don’t need IB, and it wont impact performance.
To get rid of the warning, you can set the runtime environment variable NVSHMEM_REMOTE_TRANSPORT=“none” if you are using NVSHMEM 2.1.2.

Thanks for your helpful suggestion.

I have question about the overhead of using the blocking GET and non-blocking GET. When we use the GET, it basically have two phase, one for send the request and the second phase for waiting for the results coming back, while non-blocking GET we still need to use the quiet or barrier for synchronization if we need those non-blocking GET to finish.
is there any statistics of the latency in cycles for each of these blocking/nonblocking GET APIs?

if each time I transfer a batch of data (like a matrix) instead of a single vector, will the transferring performance of NVSHMEM will be improved? I think there might be some tradeoff between the transferring performance (throughput) versus the overlapping of computation and transferring. (i.e. more coarse-grained bulky data transferring would benefit the throughput while more fine-grained data transferring would improve the overlapping of computation and transferring)
does this claim make sense?

Also compare GET with PUT, does PUT comes with lower overhead compared with GET?

Thanks a lot for your great help!

Hi Wong,

  1. The answer would depend on the platform you are running on. If you are running just on NVLink/PCIe connected GPUs - blocking and non-blocking gets are implemented exactly the same way as of now. They translate to load instructions. To get completion of non-blocking gets, you will call nvshmem_quiet and that is basically a __threadfence_system() on the GPU (In practice, in the current implementation, this would be a redundant call if you are only using non-blocking gets on P2P connected GPUs).
    When the GPUs are connected via IB, certainly non-blocking gets will be faster as they can be pipelined and then tracked for completion at once using nvshmem_quiet.

  2. Again it depends on how GPUs are connected. When GPUs are connected via NVLink, fine-grained transfers are efficient (especially if they are accesses to contiguous addresses across warp, hardware can coalesce them) and can be used to overlap computation with communication. But when the GPUs are connected via IB, fine-grained transfers are not efficient and it is recommended to do bulk transfers.

  3. GET on IB should have higher latency because of round-trip. Bandwidth-wise on IB, get and put should give you the same bandwidth. For NVLink/PCIe, put and get translate to store and load instructions. You can compare their performance using CUDA samples or using NVSHMEM perftests.