Using GPUDirect with NIC

Hi all,

I’m running CUDA 4.0 (device driver 270.41.19) on RHEL 5.6 (kernel 2.6.18). The card I’m using is a Tesla M2070.

I did some tests on the impact of page-locked memory on my application. One was an offline test where I transferred a large traffic file to the GPU in blocks of 100MB, trying both non-page-locked and page-locked host memory. Another test was using streaming traffic captured on a Myricom NIC, then transferring packets to the GPU in 100MB buffers.

My results for the data transfer rates from the host to GPU are as follows:

Offline test, page-locked memory: 5.6GB/s
Offline test, non-page-locked memory: 2.24GB/s
Streaming test: 2.26GB/s

Myricom tells me that they do use page-locked memory, but the problem is that the GPU does not know that and is treating that memory space as non-page-locked. Therefore, it’s doing an extra copy, resulting in the low performance. They recommended I use GPUDirect so both the GPU and NIC agree that the memory is page-locked.

After reading about GPUDirect, I have a couple of questions:

  1. I keep reading that GPUDirect is supported by QLogic and Mellanox’s Infiniband technology. Will it still apply to other NICs using PCI-E, such as Myricom in my case?

  2. I read that I have to set the environment variable CUDA_NIC_INTEROP=1, and I have done so using “export CUDA_NIC_INTEROP=1”. However, I don’t see a big improvement in the host-to-GPU data transfer rate. Do I need to do anything else to implement GPUDirect?

Thank you.

Have you called cudaHostRegister() on the memory used for the transfer?

You need to page-lock (cudaMallocHost) or register ( cudaHostRegister, not very flexible at the moment with the limitation of page alignment and size) your host memory and then

You could read more about the issues at

Thanks for your answers.

I’ve added cudaHostRegister() to my code, but I wasn’t able to compile it. (BTW, I couldn’t allocate the memory for the transfer using valloc() or mmap() since that memory is allocated by the NIC.)

I have another array (declared as static int *) to hold the results, and it was allocated on the device using cudaMalloc. At the line where this array’s values are cudaMemset to zero, I get the error “invalid argument”. I have checked that the number of array values to cudaMemset is smaller than the size of the array, and I didn’t think this array should be affected.

If I remove the cudaHostRegister(), then the program works fine.

I’m puzzled as to why this other array would be affected.

Yes, that’s the point of cudaHostRegister(). If you wouldn’t want to use the memory allocated by the NIC’s driver, you could just as well use cudaMallocHost().

I’m too. How can cudaHostRegister() change the behavior of your code if you aren’t even able to compile it?