cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues

We have been investigating performance issues using GPU direct from latest release of cuda4.0rc2 since last Friday. We have four C2050 installed on a Linux box running Fedoda 13 (x86_64) with kernel The machine has 4 PCI-E 2.0 slots with dual quad-core Intel E5630 running 2.53GHz, and it is equipped with 48 GByte 1066 MHz host memory. The NVIDIA driver is 270.40. What we have been doing is to measure data transfer bandwidth using GPU direct from one GPU to another and total data transfer bandwidth for 4 GPUs in a 1-dimensional ring. The test code is rather simple. We use cudaMemcpyPeer or cudaMemcpyPeerAsync to transfer data from one GPU to another. The timing values are obtained using cudaEventRecord and cudaEventEventElapsedTime after cudaEventSynchronize. The test code runs well, but we are puzzled by the results. The GPU direct data transfer bandwidth from one GPU to another is 3.5 GB/s which is very similar to the data transfer bandwidth between two GPUs using MPI. Unfortunately, the total GPU direct data transfer bandwidth for 1-D ring (GPU 1 --> GPU 2 --> GPU3 --> GPU4 -->GPU1) is around 5 GB/s which is much smaller than the aggregated data transfer bandwidth 8 GB/s using host memory (via MPI).

In addition, we also tried using GPU kernels to access remote GPU memory address directly. According to cuda4.0 document “CUDA_C_Programming_Guide” section “Peer-to-Peer Memory Access”, one is able to access device memory on remote GPUs from a GPU kernel running on a different GPU. However, when we tried to use this feature by accessing the remote memory address of a float array on another GPU from a GPU kernel, we had a kernel launch failure error.

Has anyone observed the similar poor performance number using GPU direct? Has anyone tried to launch a GPU kernel to access remote GPU memory address with success? Thank you.

Peer to peer works well for me, with good performance.

I’ve used it on Teslas and GTX580s (thank you guys for enabling this feature even on geforce).

I’ve not tried cudaMemcpyPeer, instead I’ve used either the regular, now generic cudaMemcpy, and I’ve also used pointers inside kernels.

Make sure you explicitly enable the peer to peer transfers between the GPUs that need to communicate, by default it’s disabled.

IIRC pulling data from another GPU is slightly faster than pushing data into another GPU.

With 2 GPUs I got ~5 GB/s one direction and ~10 GB/s both directions (each GPU reading data from the other one).

The doc is a bit confusing because it does not always assume that you have UVA.

Assuming your GPUs support UVA, just enable P2P and go to town !

We did check UVA is on for all GPUs by doing cudaGetDeviceProperties. Also we use cudaCanAccessPeer call to make sure a GPU can access other GPUs before we do anything. But we did not use cudaEnablePeerAccess since cudaCanAccessPeer all return 1. I just added cudaEnablePeerAccess and my linux box is crashed. What Linux distribution and kernel version are you using? Thank you.

You definitely need to call cudaEnablePeerAccess, cudaCanAccessPeer just tells you if it’s supported, that’s how I understand it.

Regarding your crash, I can reproduce the problem.

I have one machine with 2xGTX580 running OpenSuse 11.3 and it works just fine.

I tried on 3 machines, each has 4xGTX580 running RH5.5, and they all crashed immediately.

I can run on those 4-GPU machines if I only use 2 GPUs.

I’ve not tried 3, too many kernel panics for today :-)

So there seems to be an issue with peer2peer and 4 GPUs. Maybe it’s linked to the distribution, redhat and fedora are pretty similar.

I can’t put 4 GPUs in my OpenSuse box though.

The problem is related to the PCI-e topology. The GPUs need to be under the same PCI-e root domain ( otherwise you can expect slowdowns or crashes).
You need to be very careful with dual IOH motherboards.

You can check it with “/sbin/lspci -tv”.

This configuration will work:
/sbin/lspci -tv
| ±01.0-[0000:82]–
| ±03.0-[0000:83]–
| ±07.0-[0000:84]–
-[0000:00]-±00.0 Intel Corporation 5520 I/O Hub to ESI Port
±03.0-[0000:02]–±00.0 nVidia Corporation Unknown device 06c0
| -00.1 nVidia Corporation Unknown device 0be5
±07.0-[0000:03]–±00.0 nVidia Corporation Unknown device 06d1
| -00.1 nVidia Corporation Unknown device 0be5

This one will fail.
/sbin/lspci -tv
| ±01.0-[0000:82]–
| ±03.0-[0000:83]–
| ±07.0-[0000:84]–±00.0 nVidia Corporation Unknown device 06c0
| | -00.1 nVidia Corporation Unknown device 0be5
-[0000:00]-±00.0 Intel Corporation 5520 I/O Hub to ESI Port
±07.0-[0000:03]–±00.0 nVidia Corporation Unknown device 06d1
| -00.1 nVidia Corporation Unknown device 0be5

P2P across IOHs will be disabled for 4.0 final.

Thank you for the clarification. What about peer access across IOHs beyond version 4?

Indeed our configuration has 4 GPUs with 2 IOHs. Thank you very much.

You could play with CUDA_VISIBLE_DEVICES to pick up two cards that are on the same IOH.

Something similar to:

Probably not.

I am with a group that has been doing medical imaging and radiation oncology research with GPUs for a couple years now. We want to buy a new Mult-GPU workstation (and perhaps grow it into a cluster) because some of our codes are still too slow to be practical in a clinical environment when running on only 1 GPU. Peer-to-peer communication is vital to these algorithms as relatively large amounts of data must be exchanged during run time.

Now that you know I’m serious, I have a few questions:

When tmurray says:

What exactly will be disabled? P2P memory copies or P2P memory access within kernels, or both?

Will P2P still be supported between GPUs connected to the same IOH on a machine with dual IOHs?

If So:

Will a process or thread be able to tell which pair of GPUs are on the same IOH at runtime? Is it important that the thread or process be running on the CPU that is connected directly to the IOH said GPUs are connected to? Is there any way to assure this will happen?

I also have another, somewhat related, question regarding this quote in the press release:

How does one go about doing this (code example)? Does this data transfer happen without any host staging (meaning a GPU on one node sends data over Infiniband directly to GPU memory on another node)? -> Found answer here:

I looked at the code and was wondering why this is only supported for Tesla cards? It seems as though the only deference is now the Infiniband card can read directly from the pinned memory allocated by CUDA; how does the GPU get involved with the MPI send/receive? Will Tesla restriction be lifted (or has it already with rc2)?

Will P2P support for GeForce Fermi cards (recently added for 4.0rc2) be dropped with 4.0 final?

These are very important questions I need answered before we spend big bucks. I cant find this information anywhere else so THANK YOU NVIDIA PEOPLE in advance! :thanks:


Some additional info on your questions about GPUDirect P2P support…

GPUDirect P2P will continue to work on Fermi-based GeForce, Tesla and Quadro GPUs in the CUDA 4.0 production release.

NVIDIA GPUs are designed to take full advantage of the PCIe P2P specification. Unfortunately, current Intel I/O Hub (IOH) chipsets do not fully support PCIe P2P communictation so we need to disable P2P communication bewteen GPUs that are connected to different Intel IOH chipsets. This limtation affects both P2P transfers (e.g. memory copies initiated by cudaMemcpy in your host code) and P2P memory accesses within kernels.

P2P communication will continue to be supported where possible, for example:

  • between GPUs connected to the same PCIe root complex (e.g Intel IOH or AMD chipset)

  • between GPUs connected to different AMD chipsets that communicate via HyperTransport

Additional details on this limitation in current Intel IOH chipsets is documented in Table 7-4 of the IOH Datasheet at:

The good news is that the code path for GPUDirect P2P Transfers (e.g. using cudaMemcpy) is always supported. If the P2P path between two particular GPUs is not supported because they are connected to two different Intel IOHs, cudaMemcpy automatically uses a Device-to-Host-to-Device fallpath path for you.

We’re adding more details on P2P communication to the CUDA Programming Guide and SDK code samples for the final release, so this should all be much easier for developers to understand and use.

BTW, if you haven’t already, please take a few minutes to fill out the CUDA 4.0 Feedback Survey and share your experience so far.