Calling CudaHostAlloc (or cudaMallocHost) on a non-default stream

Hi all,

I want to overlap host-to-device memory copy with the kernel executions on the device.
Here is what I’ve done:

  1. I use two separate non-default streams; one for the memory copy, and another one for the kernels.
  2. I use cudaMemcpyAsync.
  3. I use cudaHostAlloc for the memory allocation on the host to have the memory pinned.

Using the visual profiler, I can see that the memory copy does happen on its own stream without blocking the kernels. BUT, the problem is that whenever cudaHostAlloc is called, kernel executions are stopped until cudaHostAlloc is finished.

As far as I could figure out, this is because cudaHostAlloc uses the default stream. So, one solution might be to create the kernels’ stream with the non-blocking flag. However, that does not work for me because I DO want the kernels’ stream to be blocking with respect to the defautl stream.

So, how can I tell cuda to exeute cudaHostAlloc on a specific stream, and not the default stream?


You can’t and the reason is not because cudaHostAlloc works on a particular stream.

Operations that modify the GPU memory map are often (usually) synchronizing.
You would run into the same problem if you did a cudaMalloc in the same place.

Get your allocations out of the performance-sensitive loop.

Thanks for your prompt response.

You’re right. I tried the non-blocking flag for the stream and it did not solve the problem.
So, I think I have to use some kind of a pre-allocated memory pool and manage it myself. I was hoping to be able to avoid that :(