We have been investigating performance issues using GPU direct from latest release of cuda4.0rc2 since last Friday. We have four C2050 installed on a Linux box running Fedoda 13 (x86_64) with kernel 220.127.116.11. The machine has 4 PCI-E 2.0 slots with dual quad-core Intel E5630 running 2.53GHz, and it is equipped with 48 GByte 1066 MHz host memory. The NVIDIA driver is 270.40. What we have been doing is to measure data transfer bandwidth using GPU direct from one GPU to another and total data transfer bandwidth for 4 GPUs in a 1-dimensional ring. The test code is rather simple. We use cudaMemcpyPeer or cudaMemcpyPeerAsync to transfer data from one GPU to another. The timing values are obtained using cudaEventRecord and cudaEventEventElapsedTime after cudaEventSynchronize. The test code runs well, but we are puzzled by the results. The GPU direct data transfer bandwidth from one GPU to another is 3.5 GB/s which is very similar to the data transfer bandwidth between two GPUs using MPI. Unfortunately, the total GPU direct data transfer bandwidth for 1-D ring (GPU 1 --> GPU 2 --> GPU3 --> GPU4 -->GPU1) is around 5 GB/s which is much smaller than the aggregated data transfer bandwidth 8 GB/s using host memory (via MPI).
In addition, we also tried using GPU kernels to access remote GPU memory address directly. According to cuda4.0 document “CUDA_C_Programming_Guide” section 18.104.22.168 “Peer-to-Peer Memory Access”, one is able to access device memory on remote GPUs from a GPU kernel running on a different GPU. However, when we tried to use this feature by accessing the remote memory address of a float array on another GPU from a GPU kernel, we had a kernel launch failure error.
Has anyone observed the similar poor performance number using GPU direct? Has anyone tried to launch a GPU kernel to access remote GPU memory address with success? Thank you.