Why is cudaMallocHost() so slow?

While working on a CUDA implementation of a well-known image feature extractor, called SIFT, I’ve been wondering whether to use malloc() or cudaMallocHost() for host memory allocation. On quite a number of places I have functions like the one below. Small pieces of memory are transfered back and forth between device and host, for manipulation on the host. In these cases I need some temporary host memory. For some reason cudaMallocHost() is costly indeed. Why is that the case? In the case below the allocation cost is far higher than the cost of communication. Is the only benefit from using cudaMallocHost() faster communication (at the expense of slower allocation)? What about stability? Are there any differences?

The section below is executed in 1.3 ms with cudaMallocHost(), but only in 0.1 ms using malloc(). In my case I have around 10 such cases for a piece of code that requires about 15 ms when temporary allocation is done using malloc(). Thus the computational cost is almost doubled when I use cudaMallocHost(), i.e. 15 ms + 10x1.3 ms.

int main(int argc, char **argv) 

{     

  const int numPts = 300;

  int sz = sizeof(float)*numPts;

  float *d_data = 0, *h_data = 0; 

  cudaMalloc((void **)&d_data, sizeof(float)*numPts);

 unsigned int hTimer;

  CUT_SAFE_CALL(cutCreateTimer(&hTimer));

  CUT_SAFE_CALL(cutResetTimer(hTimer));

  CUT_SAFE_CALL(cutStartTimer(hTimer));

 cudaMallocHost((void **)&h_data, sz);

  CUDA_SAFE_CALL(cudaMemcpy(h_data, d_data, sz, cudaMemcpyDeviceToHost));

  /* Do some manipulation */

  CUDA_SAFE_CALL(cudaMemcpy(d_data, h_data, sz, cudaMemcpyHostToDevice));

  cudaFreeHost(h_data); 

 CUT_SAFE_CALL(cutStopTimer(hTimer));

  double gpuTime = cutGetTimerValue(hTimer);	

  CUT_SAFE_CALL(cutDeleteTimer(hTimer));

  printf("time = %.2f msec\n", gpuTime);

 cudaFree(d_data);

}

Logically, the reason why the pinned memory allocators are likely slower is because the system’s kernel has to manipulate the virtual memory mappings for the range of affected pages. The idea of the pinned memory allocations is that you pay that cost up-front, rather than during the DMA operations to/from the GPU. Ideally, you’d reuse the same memory buffer for many host/GPU transfers and thus amortize the cost of the allocation, giving a significant speedup. The pinned allocations are great for people that have iterative or multi-pass algorithms where data gets bounced back and forth between the host and the GPU as a necessity of the computation.

Cheers,
John Stone

Thank you, John. That makes sense. I’ll try to decide whether allocating a buffer early on, for temporary usage for various reasons within the code, is an alternative for me. A problem in my case is that you rarely know in advance how large buffers you’ll need, this due to the nature of the algorithm.

cudaMallocHost/cuMemAllocHost not only allocate pinned memory on the CPU side, but also map that memory into the GPU’s address space. That operation is currently expensive, so it’s best to amortize the allocations if possible.

If it is mapped into the GPU’s address space, is there any way to get the address directly for use in a kernel? That’d be so great.

Sorry, I should have said the memory has to be mapped into the scatter/gather data structures needed by the GPU to perform DMA. My point was that both the CPU and the GPU have work to do when you call cudaMallocHost.

wrt mapping host memory into kernel address space, agreed, that would be a nice feature (though it does introduce some interesting synchronization and coherency issues) and we’re certainly thinking about it.

Didn’t beta 0.8 or one of the other very early revs have the ability to map host memory? I seem to recall that you guys had some API for that and decided to remove it due to the fact that it made very inefficient use of the host bus, or something like that? I had previously pondered whether having a host memory buffer mapped on the GPU would be helpful or not, and I was thinking that for some kernels it might be quite nice to be able to do that if they made very infrequent access to host memory, relative to the amount of work they did internal to the GPU. Maybe if this feature was only allowed for pinned host memory buffers you could add something like this again.

Cheers,

John Stone

Well you can, you need to use cudaHostAlloc along with cudaHostAllocMapped flag.

With that, you can get a device pointer using cudaHostGetDevicePointer().

However, in this case, the memory accesses are carried out during kernel execution, which could cause the SM to starve for data (when you have a lot of data). This case can be used mainly to simplify code, and must be used only for very small data accesses so as to not hurt the performance.