cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues

jiechenva · April 12, 2011, 5:52pm

We have been investigating performance issues using GPU direct from latest release of cuda4.0rc2 since last Friday. We have four C2050 installed on a Linux box running Fedoda 13 (x86_64) with kernel 2.6.38.2. The machine has 4 PCI-E 2.0 slots with dual quad-core Intel E5630 running 2.53GHz, and it is equipped with 48 GByte 1066 MHz host memory. The NVIDIA driver is 270.40. What we have been doing is to measure data transfer bandwidth using GPU direct from one GPU to another and total data transfer bandwidth for 4 GPUs in a 1-dimensional ring. The test code is rather simple. We use cudaMemcpyPeer or cudaMemcpyPeerAsync to transfer data from one GPU to another. The timing values are obtained using cudaEventRecord and cudaEventEventElapsedTime after cudaEventSynchronize. The test code runs well, but we are puzzled by the results. The GPU direct data transfer bandwidth from one GPU to another is 3.5 GB/s which is very similar to the data transfer bandwidth between two GPUs using MPI. Unfortunately, the total GPU direct data transfer bandwidth for 1-D ring (GPU 1 → GPU 2 → GPU3 → GPU4 -->GPU1) is around 5 GB/s which is much smaller than the aggregated data transfer bandwidth 8 GB/s using host memory (via MPI).

In addition, we also tried using GPU kernels to access remote GPU memory address directly. According to cuda4.0 document “CUDA_C_Programming_Guide” section 3.2.6.4 “Peer-to-Peer Memory Access”, one is able to access device memory on remote GPUs from a GPU kernel running on a different GPU. However, when we tried to use this feature by accessing the remote memory address of a float array on another GPU from a GPU kernel, we had a kernel launch failure error.

Has anyone observed the similar poor performance number using GPU direct? Has anyone tried to launch a GPU kernel to access remote GPU memory address with success? Thank you.

thomasco · April 13, 2011, 4:45am

We have been investigating performance issues using GPU direct from latest release of cuda4.0rc2 since last Friday. We have four C2050 installed on a Linux box running Fedoda 13 (x86_64) with kernel 2.6.38.2. The machine has 4 PCI-E 2.0 slots with dual quad-core Intel E5630 running 2.53GHz, and it is equipped with 48 GByte 1066 MHz host memory. The NVIDIA driver is 270.40. What we have been doing is to measure data transfer bandwidth using GPU direct from one GPU to another and total data transfer bandwidth for 4 GPUs in a 1-dimensional ring. The test code is rather simple. We use cudaMemcpyPeer or cudaMemcpyPeerAsync to transfer data from one GPU to another. The timing values are obtained using cudaEventRecord and cudaEventEventElapsedTime after cudaEventSynchronize. The test code runs well, but we are puzzled by the results. The GPU direct data transfer bandwidth from one GPU to another is 3.5 GB/s which is very similar to the data transfer bandwidth between two GPUs using MPI. Unfortunately, the total GPU direct data transfer bandwidth for 1-D ring (GPU 1 → GPU 2 → GPU3 → GPU4 -->GPU1) is around 5 GB/s which is much smaller than the aggregated data transfer bandwidth 8 GB/s using host memory (via MPI).

In addition, we also tried using GPU kernels to access remote GPU memory address directly. According to cuda4.0 document “CUDA_C_Programming_Guide” section 3.2.6.4 “Peer-to-Peer Memory Access”, one is able to access device memory on remote GPUs from a GPU kernel running on a different GPU. However, when we tried to use this feature by accessing the remote memory address of a float array on another GPU from a GPU kernel, we had a kernel launch failure error.

Has anyone observed the similar poor performance number using GPU direct? Has anyone tried to launch a GPU kernel to access remote GPU memory address with success? Thank you.

Peer to peer works well for me, with good performance.

I’ve used it on Teslas and GTX580s (thank you guys for enabling this feature even on geforce).

I’ve not tried cudaMemcpyPeer, instead I’ve used either the regular, now generic cudaMemcpy, and I’ve also used pointers inside kernels.

Make sure you explicitly enable the peer to peer transfers between the GPUs that need to communicate, by default it’s disabled.

IIRC pulling data from another GPU is slightly faster than pushing data into another GPU.

With 2 GPUs I got ~5 GB/s one direction and ~10 GB/s both directions (each GPU reading data from the other one).

The doc is a bit confusing because it does not always assume that you have UVA.

Assuming your GPUs support UVA, just enable P2P and go to town !

jiechenva · April 13, 2011, 1:27pm

We did check UVA is on for all GPUs by doing cudaGetDeviceProperties. Also we use cudaCanAccessPeer call to make sure a GPU can access other GPUs before we do anything. But we did not use cudaEnablePeerAccess since cudaCanAccessPeer all return 1. I just added cudaEnablePeerAccess and my linux box is crashed. What Linux distribution and kernel version are you using? Thank you.

thomasco · April 13, 2011, 8:30pm

You definitely need to call cudaEnablePeerAccess, cudaCanAccessPeer just tells you if it’s supported, that’s how I understand it.

Regarding your crash, I can reproduce the problem.

I have one machine with 2xGTX580 running OpenSuse 11.3 and it works just fine.

I tried on 3 machines, each has 4xGTX580 running RH5.5, and they all crashed immediately.

I can run on those 4-GPU machines if I only use 2 GPUs.

I’ve not tried 3, too many kernel panics for today :-)

So there seems to be an issue with peer2peer and 4 GPUs. Maybe it’s linked to the distribution, redhat and fedora are pretty similar.

I can’t put 4 GPUs in my OpenSuse box though.

mfatica · April 13, 2011, 9:07pm

The problem is related to the PCI-e topology. The GPUs need to be under the same PCI-e root domain ( otherwise you can expect slowdowns or crashes).
You need to be very careful with dual IOH motherboards.

You can check it with “/sbin/lspci -tv”.

This configuration will work:
/sbin/lspci -tv
-±[0000:80]-±00.0-[0000:81]–
| ±01.0-[0000:82]–
| ±03.0-[0000:83]–
| ±07.0-[0000:84]–
-[0000:00]-±00.0 Intel Corporation 5520 I/O Hub to ESI Port
±01.0-[0000:01]–
±03.0-[0000:02]–±00.0 nVidia Corporation Unknown device 06c0
| -00.1 nVidia Corporation Unknown device 0be5
±07.0-[0000:03]–±00.0 nVidia Corporation Unknown device 06d1
| -00.1 nVidia Corporation Unknown device 0be5

This one will fail.
/sbin/lspci -tv
-±[0000:80]-±00.0-[0000:81]–
| ±01.0-[0000:82]–
| ±03.0-[0000:83]–
| ±07.0-[0000:84]–±00.0 nVidia Corporation Unknown device 06c0
| | -00.1 nVidia Corporation Unknown device 0be5
|
-[0000:00]-±00.0 Intel Corporation 5520 I/O Hub to ESI Port
±01.0-[0000:01]–
±03.0-[0000:02]–
±07.0-[0000:03]–±00.0 nVidia Corporation Unknown device 06d1
| -00.1 nVidia Corporation Unknown device 0be5

tmurray · April 13, 2011, 9:25pm

P2P across IOHs will be disabled for 4.0 final.

jiechenva · April 14, 2011, 5:20pm

Thank you for the clarification. What about peer access across IOHs beyond version 4?

jiechenva · April 14, 2011, 5:21pm

The problem is related to the PCI-e topology. The GPUs need to be under the same PCI-e root domain ( otherwise you can expect slowdowns or crashes).

You need to be very careful with dual IOH motherboards.

You can check it with “/sbin/lspci -tv”.

This configuration will work:

/sbin/lspci -tv

-±[0000:80]-±00.0-[0000:81]–

| ±01.0-[0000:82]–

| ±03.0-[0000:83]–

| ±07.0-[0000:84]–

-[0000:00]-±00.0 Intel Corporation 5520 I/O Hub to ESI Port
         +-01.0-[0000:01]--

         +-03.0-[0000:02]--+-00.0  nVidia Corporation Unknown device 06c0

         |                 \-00.1  nVidia Corporation Unknown device 0be5

         +-07.0-[0000:03]--+-00.0  nVidia Corporation Unknown device 06d1

         |                 \-00.1  nVidia Corporation Unknown device 0be5
This one will fail.

/sbin/lspci -tv

-±[0000:80]-±00.0-[0000:81]–

| ±01.0-[0000:82]–

| ±03.0-[0000:83]–

| ±07.0-[0000:84]–±00.0 nVidia Corporation Unknown device 06c0

| | -00.1 nVidia Corporation Unknown device 0be5

|

-[0000:00]-±00.0 Intel Corporation 5520 I/O Hub to ESI Port
        +-01.0-[0000:01]--

        +-03.0-[0000:02]--

        +-07.0-[0000:03]--+-00.0  nVidia Corporation Unknown device 06d1

        |                 \-00.1  nVidia Corporation Unknown device 0be5

Indeed our configuration has 4 GPUs with 2 IOHs. Thank you very much.

mfatica · April 14, 2011, 5:25pm

You could play with CUDA_VISIBLE_DEVICES to pick up two cards that are on the same IOH.

Something similar to:
export CUDA_VISIBLE_DEVICES=2,3

tmurray · April 14, 2011, 7:10pm

Probably not.

RadOncMike · April 20, 2011, 8:44am

I am with a group that has been doing medical imaging and radiation oncology research with GPUs for a couple years now. We want to buy a new Mult-GPU workstation (and perhaps grow it into a cluster) because some of our codes are still too slow to be practical in a clinical environment when running on only 1 GPU. Peer-to-peer communication is vital to these algorithms as relatively large amounts of data must be exchanged during run time.

Now that you know I’m serious, I have a few questions:

When tmurray says:

What exactly will be disabled? P2P memory copies or P2P memory access within kernels, or both?

Will P2P still be supported between GPUs connected to the same IOH on a machine with dual IOHs?

If So:

Will a process or thread be able to tell which pair of GPUs are on the same IOH at runtime? Is it important that the thread or process be running on the CPU that is connected directly to the IOH said GPUs are connected to? Is there any way to assure this will happen?

I also have another, somewhat related, question regarding this quote in the press release:

~~How does one go about doing this (code example)? Does this data transfer happen without any host staging (meaning a GPU on one node sends data over Infiniband directly to GPU memory on another node)?~~ → Found answer here: Accelerated Applications Catalog | NVIDIA

I looked at the code and was wondering why this is only supported for Tesla cards? It seems as though the only deference is now the Infiniband card can read directly from the pinned memory allocated by CUDA; how does the GPU get involved with the MPI send/receive? Will Tesla restriction be lifted (or has it already with rc2)?

Will P2P support for GeForce Fermi cards (recently added for 4.0rc2) be dropped with 4.0 final?

These are very important questions I need answered before we spend big bucks. I cant find this information anywhere else so THANK YOU NVIDIA PEOPLE in advance! External Image

Will_Ramey · May 3, 2011, 8:25pm

@RadOncMike:

Some additional info on your questions about GPUDirect P2P support…

GPUDirect P2P will continue to work on Fermi-based GeForce, Tesla and Quadro GPUs in the CUDA 4.0 production release.

NVIDIA GPUs are designed to take full advantage of the PCIe P2P specification. Unfortunately, current Intel I/O Hub (IOH) chipsets do not fully support PCIe P2P communictation so we need to disable P2P communication bewteen GPUs that are connected to different Intel IOH chipsets. This limtation affects both P2P transfers (e.g. memory copies initiated by cudaMemcpy in your host code) and P2P memory accesses within kernels.

P2P communication will continue to be supported where possible, for example:

between GPUs connected to the same PCIe root complex (e.g Intel IOH or AMD chipset)
between GPUs connected to different AMD chipsets that communicate via HyperTransport

Additional details on this limitation in current Intel IOH chipsets is documented in Table 7-4 of the IOH Datasheet at: http://www.intel.com/Assets/PDF/datasheet/321328.pdf

The good news is that the code path for GPUDirect P2P Transfers (e.g. using cudaMemcpy) is always supported. If the P2P path between two particular GPUs is not supported because they are connected to two different Intel IOHs, cudaMemcpy automatically uses a Device-to-Host-to-Device fallpath path for you.

We’re adding more details on P2P communication to the CUDA Programming Guide and SDK code samples for the final release, so this should all be much easier for developers to understand and use.

BTW, if you haven’t already, please take a few minutes to fill out the CUDA 4.0 Feedback Survey and share your experience so far.

Thanks,

Will.

Topic		Replies	Views
P2P: How do I know if cudaMemcpy falls back to non-P2P? CUDA Programming and Performance	8	2355	October 12, 2021
multi-GPU Peer to Peer access CUDA SDK example not working, why? CUDA Programming and Performance	13	5154	February 26, 2015
Confused about GTX Titan Z Peer-To-Peer (P2) capability CUDA Programming and Performance	19	5077	February 23, 2015
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	5061	November 27, 2018
P2P access not enabled, is this a software or a hardware issue? CUDA Setup and Installation	7	9608	November 10, 2015
Using GTX 590 cards for CUDA SLI cards under CUDA? CUDA Programming and Performance	37	14219	April 2, 2012
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8100	June 30, 2010
CUDA 4.0 CUDA Programming and Performance	63	507397	March 28, 2013
Not all cuda devices detected in cuda fortran Windows 10 Legacy PGI Compilers	11	21991	November 19, 2018
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2737	April 11, 2019

cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues

Related topics