I’m running CUDA 4.0 (device driver 270.41.19) on RHEL 5.6 (kernel 2.6.18). The card I’m using is a Tesla M2070.
I did some tests on the impact of page-locked memory on my application. One was an offline test where I transferred a large traffic file to the GPU in blocks of 100MB, trying both non-page-locked and page-locked host memory. Another test was using streaming traffic captured on a Myricom NIC, then transferring packets to the GPU in 100MB buffers.
My results for the data transfer rates from the host to GPU are as follows:
Myricom tells me that they do use page-locked memory, but the problem is that the GPU does not know that and is treating that memory space as non-page-locked. Therefore, it’s doing an extra copy, resulting in the low performance. They recommended I use GPUDirect so both the GPU and NIC agree that the memory is page-locked.
After reading about GPUDirect, I have a couple of questions:
I keep reading that GPUDirect is supported by QLogic and Mellanox’s Infiniband technology. Will it still apply to other NICs using PCI-E, such as Myricom in my case?
I read that I have to set the environment variable CUDA_NIC_INTEROP=1, and I have done so using “export CUDA_NIC_INTEROP=1”. However, I don’t see a big improvement in the host-to-GPU data transfer rate. Do I need to do anything else to implement GPUDirect?
You need to page-lock (cudaMallocHost) or register ( cudaHostRegister, not very flexible at the moment with the limitation of page alignment and size) your host memory and then
I’ve added cudaHostRegister() to my code, but I wasn’t able to compile it. (BTW, I couldn’t allocate the memory for the transfer using valloc() or mmap() since that memory is allocated by the NIC.)
I have another array (declared as static int *) to hold the results, and it was allocated on the device using cudaMalloc. At the line where this array’s values are cudaMemset to zero, I get the error “invalid argument”. I have checked that the number of array values to cudaMemset is smaller than the size of the array, and I didn’t think this array should be affected.
If I remove the cudaHostRegister(), then the program works fine.
I’m puzzled as to why this other array would be affected.