Benchmarking GPUDirect RDMA on Modern Server Platforms

Originally published at: https://developer.nvidia.com/blog/benchmarking-gpudirect-rdma-on-modern-server-platforms/

NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. Examples of third-party devices include network interfaces, video acquisition devices, storage adapters, and medical equipment. Enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of NVIDIA…

It's very interesting how poor the bandwidth is across the QPI. Do you see similar poor performance with a CPU-CPU transfer which goes across the QPI and then goes to a remote CPU?

That is a well tested data path, reading from a peripheral through the QPI host memory of another socket. I would be surprised if there had been a severe problem.

Anyway there is a visible NUMA effect on the bandwidth below 4KB, irrespective of whether the remote node destination is GPU or Host memory.
On the PLX architecture (N.1), at 1KB bandwidth drops from 8.9GB/s (numactl --cpunodebind=0 --localalloc) to 5.6GB/s (numactl --cpunodebind=1 --localalloc).
On architecture N.2, from 8.2GB/s to 5.9GB/s.

Beyond 4KB there are no visible effects on bandwidth.

Does this mean that if we want to do GPU-GPU transfers, and we can't guarantee that it won't go across the QPI, then we are better off staging the transfer via the CPUs on either end?

I'm thinking of big MPI applications running on typical cluster compute nodes which are dual socket with 1 or 2 GPUs attached to each socket, and a single IB NIC attached to one of them.

Do you have any idea why the PLX switch perform much better than the ivy bridge's built-in pci-e switch?

Did you mean: --cpunodebind=1 ?

Correct! I edited my comment above. thank you.

We briefly discussed this topic off-line. For the records, it is MPI responsibility to use the most performing data path, possibly taking into account all architectural constraints.

Sorry if its stupid questions: If I use RDMA and CUDA then do I seen remote GPUs locally in my code?

I am trying to duplicate your test but I can't setup well. Can you post your test code or can you explain how to setup your test environment in detail?

really helpful benchmark!

a quick question, I have two servers with RDMA networking, but server1 has no GPU installed, can I enable GPUDirect from server1 host mem to server.GPU? if yes, what kinds of drivers shall be installed in server1? could u share your modified ibv_ud_pingpong and ib_write_bw to ensure it's GPU memory rather than host mem. thanks a lot.

Hi, can you share the source code of the modified test programs? I want to benchmark our clusters.

Siyuan,
in the mean time, CUDA support has been added to OpenFabrics perftest.
You can try the following sequence:
git clone git://git.openfabrics.org/~...
cd perftest
./autogen.sh
export CUDA_H_PATH=/usr/local/cuda-8.0/include/cuda.h
./configure
make

If you experience problems with the upstream version, you might also try https://github.com/drossett....

With a CUDA enabled build, you can use --use_cuda option to ib_rdma_wr to pick the GPU with device id 0. You can also use the CUDA_VISIBLE_DEVICES environment variable to pick a different GPU.
For example:
# on host_a
ib_write_bw -n 1000 -a --use_cuda
# on host_b
ib_write_bw -n 1000 -a --use_cuda host_a

Please note that the perftest latency test does not work with GPU memory. In my case, I modified the gpu_ud_pingpong test which comes with libibverbs. I apologize but I do not have an easy way to share that code.

Zhao,
Please see my reply to Siouan.

Once compiled with CUDA support, ib_write_bw requires libcuda.so at run time. If server1 has no GPU, you might not be able to install the NVIDIA driver package at all. That package contains libcuda.so, so on server1 you might need to run a non-CUDA enabled version of ib_write_bw.

Charlie,
Please see my reply to Siouan.

Hi Mrez,
Unfortunately, it is not that simple.
You might want to have a look at the rCUDA project (http://www.rcuda.net/index.....

Thanks for your quickly reply, I will try the perftest. If there are problems I cannot fix I will come back to you. :)

Hi,
I tired the proposed sequence (git clone git://git.openfabrics.org/~..., cd perftest, ./autogen.sh ...) but at the end I got the following error:

./ib_read_bw -s 1024 --use_cuda:

initializing CUDA
There are 3 devices supporting CUDA, picking first...
[pid = 52821, dev = 0] device name = [Tesla K20Xm]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 65536 bytes GPU buffer
allocated GPU buffer address at 0000002305440000 pointer=0x2305440000
Couldn't allocate MR
failed to create mr
Failed to create MR

Without using cuda the test runs correctly.
Do You have an idea what could be wrong?

(Cuda 8.0.61, nvidia 375.66, kernel 2.6.32-696.3.2.el6.x86_64)

Hi balazs, have you solved this problem? I also come up against the same problems, and have no idea.

I have a 'solution' for this issue. I got false for this query:
ompi_info --all | grep btl_openib_have_driver_gdr
I asked for a 'gpu rdma driver' update, then I was able to use the test (query result was true). But the result was unexpected, I got very poor send performance 0.8 GB/s (6 GB/s was the result for Host to Host)