Why is cudaMallocHost() so slow?

Celebrandil · September 26, 2007, 3:32pm

While working on a CUDA implementation of a well-known image feature extractor, called SIFT, I’ve been wondering whether to use malloc() or cudaMallocHost() for host memory allocation. On quite a number of places I have functions like the one below. Small pieces of memory are transfered back and forth between device and host, for manipulation on the host. In these cases I need some temporary host memory. For some reason cudaMallocHost() is costly indeed. Why is that the case? In the case below the allocation cost is far higher than the cost of communication. Is the only benefit from using cudaMallocHost() faster communication (at the expense of slower allocation)? What about stability? Are there any differences?

The section below is executed in 1.3 ms with cudaMallocHost(), but only in 0.1 ms using malloc(). In my case I have around 10 such cases for a piece of code that requires about 15 ms when temporary allocation is done using malloc(). Thus the computational cost is almost doubled when I use cudaMallocHost(), i.e. 15 ms + 10x1.3 ms.

int main(int argc, char **argv) 

{     

  const int numPts = 300;

  int sz = sizeof(float)*numPts;

  float *d_data = 0, *h_data = 0; 

  cudaMalloc((void **)&d_data, sizeof(float)*numPts);

 unsigned int hTimer;

  CUT_SAFE_CALL(cutCreateTimer(&hTimer));

  CUT_SAFE_CALL(cutResetTimer(hTimer));

  CUT_SAFE_CALL(cutStartTimer(hTimer));

 cudaMallocHost((void **)&h_data, sz);

  CUDA_SAFE_CALL(cudaMemcpy(h_data, d_data, sz, cudaMemcpyDeviceToHost));

  /* Do some manipulation */

  CUDA_SAFE_CALL(cudaMemcpy(d_data, h_data, sz, cudaMemcpyHostToDevice));

  cudaFreeHost(h_data); 

 CUT_SAFE_CALL(cutStopTimer(hTimer));

  double gpuTime = cutGetTimerValue(hTimer);	

  CUT_SAFE_CALL(cutDeleteTimer(hTimer));

  printf("time = %.2f msec\n", gpuTime);

 cudaFree(d_data);

}

tachyon_john · September 26, 2007, 9:54pm

Logically, the reason why the pinned memory allocators are likely slower is because the system’s kernel has to manipulate the virtual memory mappings for the range of affected pages. The idea of the pinned memory allocations is that you pay that cost up-front, rather than during the DMA operations to/from the GPU. Ideally, you’d reuse the same memory buffer for many host/GPU transfers and thus amortize the cost of the allocation, giving a significant speedup. The pinned allocations are great for people that have iterative or multi-pass algorithms where data gets bounced back and forth between the host and the GPU as a necessity of the computation.

Cheers,
John Stone

Celebrandil · September 27, 2007, 11:09am

Thank you, John. That makes sense. I’ll try to decide whether allocating a buffer early on, for temporary usage for various reasons within the code, is an alternative for me. A problem in my case is that you rarely know in advance how large buffers you’ll need, this due to the nature of the algorithm.

nwilt · September 28, 2007, 10:06am

cudaMallocHost/cuMemAllocHost not only allocate pinned memory on the CPU side, but also map that memory into the GPU’s address space. That operation is currently expensive, so it’s best to amortize the allocations if possible.

wumpus · September 28, 2007, 12:31pm

If it is mapped into the GPU’s address space, is there any way to get the address directly for use in a kernel? That’d be so great.

nwilt · September 28, 2007, 1:30pm

Sorry, I should have said the memory has to be mapped into the scatter/gather data structures needed by the GPU to perform DMA. My point was that both the CPU and the GPU have work to do when you call cudaMallocHost.

wrt mapping host memory into kernel address space, agreed, that would be a nice feature (though it does introduce some interesting synchronization and coherency issues) and we’re certainly thinking about it.

tachyon_john · September 28, 2007, 3:58pm

Didn’t beta 0.8 or one of the other very early revs have the ability to map host memory? I seem to recall that you guys had some API for that and decided to remove it due to the fact that it made very inefficient use of the host bus, or something like that? I had previously pondered whether having a host memory buffer mapped on the GPU would be helpful or not, and I was thinking that for some kernels it might be quite nice to be able to do that if they made very infrequent access to host memory, relative to the amount of work they did internal to the GPU. Maybe if this feature was only allowed for pinned host memory buffers you could add something like this again.

Cheers,

John Stone

chaithyagr · November 17, 2021, 2:02pm

Well you can, you need to use cudaHostAlloc along with cudaHostAllocMapped flag.

With that, you can get a device pointer using cudaHostGetDevicePointer().

However, in this case, the memory accesses are carried out during kernel execution, which could cause the SM to starve for data (when you have a lot of data). This case can be used mainly to simplify code, and must be used only for very small data accesses so as to not hurt the performance.

Topic		Replies	Views
cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available) CUDA Programming and Performance	5	974	July 26, 2022
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	615	March 28, 2024
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2145	August 26, 2011
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	610	March 7, 2019
cudaHostAlloc: Pinned memory creation very slow! CUDA Programming and Performance	7	7617	January 5, 2012
cudamalloc slow CUDA Programming and Performance	5	8379	November 13, 2015
cudaMallocHost() vs. malloc() 1st "cudaMallocHost()" lasts ~90ms!! CUDA Programming and Performance	5	15072	July 3, 2007
selfmade cudeMallocHost()? CUDA Programming and Performance	9	8657	February 14, 2008
Advantages/Disadvantages of using pinned memory CUDA Programming and Performance	6	13703	May 4, 2018
cudaMallocHost confusion CUDA Programming and Performance	6	9825	June 24, 2011

Why is cudaMallocHost() so slow?

Related topics