Benchmarking GPUDirect RDMA on Modern Server Platforms

jwitsoe · August 18, 2014, 6:28am

Originally published at: https://developer.nvidia.com/blog/benchmarking-gpudirect-rdma-on-modern-server-platforms/

NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. Examples of third-party devices include network interfaces, video acquisition devices, storage adapters, and medical equipment. Enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of NVIDIA…

anon35019681 · October 14, 2014, 7:56pm

It's very interesting how poor the bandwidth is across the QPI. Do you see similar poor performance with a CPU-CPU transfer which goes across the QPI and then goes to a remote CPU?

anon27539225 · October 17, 2014, 11:14am

That is a well tested data path, reading from a peripheral through the QPI host memory of another socket. I would be surprised if there had been a severe problem.

Anyway there is a visible NUMA effect on the bandwidth below 4KB, irrespective of whether the remote node destination is GPU or Host memory.
On the PLX architecture (N.1), at 1KB bandwidth drops from 8.9GB/s (numactl --cpunodebind=0 --localalloc) to 5.6GB/s (numactl --cpunodebind=1 --localalloc).
On architecture N.2, from 8.2GB/s to 5.9GB/s.

Beyond 4KB there are no visible effects on bandwidth.

anon35019681 · October 17, 2014, 11:29am

Does this mean that if we want to do GPU-GPU transfers, and we can't guarantee that it won't go across the QPI, then we are better off staging the transfer via the CPUs on either end?

I'm thinking of big MPI applications running on typical cluster compute nodes which are dual socket with 1 or 2 GPUs attached to each socket, and a single IB NIC attached to one of them.

anon4205104 · October 17, 2014, 3:19pm

Do you have any idea why the PLX switch perform much better than the ivy bridge's built-in pci-e switch?

anon47493034 · October 26, 2014, 3:37am

Did you mean: --cpunodebind=1 ?

anon27539225 · October 28, 2014, 8:54pm

Correct! I edited my comment above. thank you.

anon27539225 · October 28, 2014, 9:04pm

We briefly discussed this topic off-line. For the records, it is MPI responsibility to use the most performing data path, possibly taking into account all architectural constraints.

anon8457519 · March 29, 2015, 1:17pm

Sorry if its stupid questions: If I use RDMA and CUDA then do I seen remote GPUs locally in my code?

anon42610280 · November 30, 2015, 10:28am

I am trying to duplicate your test but I can't setup well. Can you post your test code or can you explain how to setup your test environment in detail?

anon71954675 · November 15, 2016, 2:03pm

really helpful benchmark!

a quick question, I have two servers with RDMA networking, but server1 has no GPU installed, can I enable GPUDirect from server1 host mem to server.GPU? if yes, what kinds of drivers shall be installed in server1? could u share your modified ibv_ud_pingpong and ib_write_bw to ensure it's GPU memory rather than host mem. thanks a lot.

anon77469328 · April 26, 2017, 2:10pm

Hi, can you share the source code of the modified test programs? I want to benchmark our clusters.

anon27539225 · April 26, 2017, 10:34pm

Siyuan,
in the mean time, CUDA support has been added to OpenFabrics perftest.
You can try the following sequence:
git clone git://git.openfabrics.org/~...
cd perftest
./autogen.sh
export CUDA_H_PATH=/usr/local/cuda-8.0/include/cuda.h
./configure
make

If you experience problems with the upstream version, you might also try https://github.com/drossett....

With a CUDA enabled build, you can use --use_cuda option to ib_rdma_wr to pick the GPU with device id 0. You can also use the CUDA_VISIBLE_DEVICES environment variable to pick a different GPU.
For example:
# on host_a
ib_write_bw -n 1000 -a --use_cuda
# on host_b
ib_write_bw -n 1000 -a --use_cuda host_a

Please note that the perftest latency test does not work with GPU memory. In my case, I modified the gpu_ud_pingpong test which comes with libibverbs. I apologize but I do not have an easy way to share that code.

anon27539225 · April 26, 2017, 10:41pm

Zhao,
Please see my reply to Siouan.

Once compiled with CUDA support, ib_write_bw requires libcuda.so at run time. If server1 has no GPU, you might not be able to install the NVIDIA driver package at all. That package contains libcuda.so, so on server1 you might need to run a non-CUDA enabled version of ib_write_bw.

anon27539225 · April 26, 2017, 10:41pm

Charlie,
Please see my reply to Siouan.

anon27539225 · April 26, 2017, 10:43pm

Hi Mrez,
Unfortunately, it is not that simple.
You might want to have a look at the rCUDA project (http://www.rcuda.net/index.....

anon77469328 · April 27, 2017, 8:29am

Thanks for your quickly reply, I will try the perftest. If there are problems I cannot fix I will come back to you. :)

anon96523086 · August 28, 2017, 12:05pm

Hi,
I tired the proposed sequence (git clone git://git.openfabrics.org/~..., cd perftest, ./autogen.sh ...) but at the end I got the following error:

./ib_read_bw -s 1024 --use_cuda:

initializing CUDA
There are 3 devices supporting CUDA, picking first...
[pid = 52821, dev = 0] device name = [Tesla K20Xm]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 65536 bytes GPU buffer
allocated GPU buffer address at 0000002305440000 pointer=0x2305440000
Couldn't allocate MR
failed to create mr
Failed to create MR

Without using cuda the test runs correctly.
Do You have an idea what could be wrong?

(Cuda 8.0.61, nvidia 375.66, kernel 2.6.32-696.3.2.el6.x86_64)

anon19107080 · September 6, 2017, 6:16am

Hi balazs, have you solved this problem? I also come up against the same problems, and have no idea.

anon96523086 · September 6, 2017, 6:40am

I have a 'solution' for this issue. I got false for this query:
ompi_info --all | grep btl_openib_have_driver_gdr
I asked for a 'gpu rdma driver' update, then I was able to use the test (query result was true). But the result was unexpected, I got very poor send performance 0.8 GB/s (6 GB/s was the result for Host to Host)

Topic		Replies	Views
Clarification on requirements for GPUDirect RDMA CUDA Programming and Performance	16	3910	November 7, 2023
GPUDirect RDMA support with CUDA 5 CUDA Programming and Performance	19	9151	May 28, 2013
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8077	June 30, 2010
CUDA 4.0 CUDA Programming and Performance	63	507394	March 28, 2013
NVidia GPUs in Embedded Computing Has the GPU computing and CUDA penetrated the embedded market? CUDA Programming and Performance	11	3859	August 3, 2010
From NIC to GPU. CUDA Programming and Performance	40	13535	February 12, 2011
Device Memory Bandwidth CUDA Programming and Performance	17	8050	January 17, 2018
Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO Technical Blog	4	181	June 27, 2024
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Nsight Compute linux , driver	17	1275	February 9, 2024
Deploying GPUDirect RDMA on EGX Stack with the Mellanox Network Operator Technical Blog	0	455	September 30, 2020

Benchmarking GPUDirect RDMA on Modern Server Platforms

Related topics