Hi, All
I am currently working on a project related to NVSHMEM. The current hardware setting I use is two Titan Xp GPUs without SLI, and these two GPUs are installed on the same host via PCIe 3.0x16, I am not sure whether it is the correct setup for using the NVSHMEM for single machine cross GPUs communication.
Thanks!
NVSHMEM is supported only on data center class GPUs of Volta architecture and latter, primarily due to its reliance on GPU features like independent thread scheduling and communication technologies like P2P and RDMA: https://docs.nvidia.com/hpc-sdk/nvshmem/pdf/NVSHMEM-Installation-Guide.pdf.
Titan Xp which is based on Pascal does support CUDA P2P over PCIe, if they are connected on the same CPU socket (you can test using p2pBandwidthLatencyTest from CUDA samples). With this you can test simple NVSHMEM examples (like shmem_p_bw under perftest) that use p/g/put/get APIs (on_stream and device variants). But performance will likely be quite poor and more complex apps can end up in deadlocks.
Thanks for your reply.
When I run my NVSHMEM application on a DGX (4 Tesla V100 16GB), it can shows me message like
src/comm/transports/ibrc/ibrc.cpp: NULL value get_device_list failed
from the official Q&A, it says.
A: This occurs when ibverbs library is present on the system but the library is not able to detect any InfiniBand devices on the system. Make sure that the InfiniBand devices are available and/or are in a working state.
but it seems that the program can successfully run even if it shows me this message. I am only using 4 GPUs on the same CPU host.
Will this impact the performance?
Thanks
Yes, if the GPUs are P2P/PCIe connected, you don’t need IB, and it wont impact performance.
To get rid of the warning, you can set the runtime environment variable NVSHMEM_REMOTE_TRANSPORT=“none” if you are using NVSHMEM 2.1.2.
Thanks for your helpful suggestion.
I have question about the overhead of using the blocking GET and non-blocking GET. When we use the GET, it basically have two phase, one for send the request and the second phase for waiting for the results coming back, while non-blocking GET we still need to use the quiet
or barrier
for synchronization if we need those non-blocking GET to finish.
is there any statistics of the latency in cycles for each of these blocking/nonblocking GET APIs?
if each time I transfer a batch of data (like a matrix) instead of a single vector, will the transferring performance of NVSHMEM will be improved? I think there might be some tradeoff between the transferring performance (throughput) versus the overlapping of computation and transferring. (i.e. more coarse-grained bulky data transferring would benefit the throughput while more fine-grained data transferring would improve the overlapping of computation and transferring)
does this claim make sense?
Also compare GET with PUT, does PUT comes with lower overhead compared with GET?
Thanks a lot for your great help!