Memory copy/set async to kernel execution in different stream


I have a question regarding ‘async’ in cuMemcpyHtoDAsync and cuMemsetD8Async. From my understanding it means async with regards to the host CPU - but also include the kernel execution in a different stream. Here is an article of what I was hoping to get:

How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog

For my test I use a RTX 3060 on Windows 11 with VS2022 17.4.3 and cuda 12.0.

I use a dingle context for the application that I request with:
cuDevicePrimaryCtxRetain(&ctx, dev);

and release with:

Whenever one of the threads in my application needs access to the GPU it first calls:

then does the required cuda stuff and restore the old context with:

For each strem in my application I create a new stream handle with the StreamCreate function - I never use the default stream to avoid any issues with the async executions:
l_cuKernelError = cuStreamCreate( &l_iStreamInfo.cudaStream, CU_STREAM_NON_BLOCKING );

I need two memory blocks to execute the kernel. One is for readonly information, the other one is for the result of the kernel execution.

in the device code I define both memory blocks like this:

#define CUDA_STREAMS 2

device constant CUDA_KERNEL_INFO dev_CudaKernelInfo[CUDA_STREAMS];
device CUDA_RESULT_INFO dev_CudaResultInfo[CUDA_STREAMS];

Form what I understood from this ( CUDA Runtime API :: CUDA Toolkit Documentation ( I need to allocated ‘pinned’ memory to enable async behavior.

In the host code I allocate the memory block for the const readonly memory like this:
l_cuKernelError = cuMemHostAlloc( (void**)&l_iStreamInfo.pHostKernelInfo, sizeof( CUDA_KERNEL_INFO ) * CUDA_STREAMS, CU_MEMHOSTALLOC_PORTABLE|CU_MEMHOSTALLOC_WRITECOMBINED );

The block for the results is allocated like this:
l_cuKernelError = cuMemHostAlloc( (void**)&l_iStreamInfo.pHostResultInfo, sizeof( CUDA_RESULT_INFO ) * CUDA_STREAMS, CU_MEMHOSTALLOC_PORTABLE );

The workflow of for each kernel execution is quite simple:

  1. copy readonly memory to device
  2. Inti result memory block
  3. execute kernel
  4. copy results back from device to host
  5. launch host function to notify that the kernel finishes work

As soon as one stream finishes works the host start analyzing the results and restarts the kernel with new work.

This is how each step is implemented:

  1. copy readonly memory to device
    l_cuKernelError = cuMemcpyHtoDAsync( p_pStreamInfo->devKernelInfoPtr, (uint8_t*)p_pStreamInfo->pHostKernelInfo, sizeof( CUDA_KERNEL_INFO ), p_pStreamInfo->cudaStream )
    rem: p_pStreamInfo->devKernelInfoPtr is a pointer to the correct array-item of the stream

  2. Inti result memory block
    l_cuKernelError = cuMemsetD8Async( p_pStreamInfo->devResultInfoPtr, 0, (int32_t)sizeof( CUDA_RESULT_INFO ), p_pStreamInfo->cudaStream );
    rem: p_pStreamInfo->devResultInfoPtr is a pointer to the correct array-item of the stream

  3. execute kernel
    void* l_arrayStreamArgs[] = { &p_pStreamInfo->stStreamId, &p_pStreamInfo->ui64VarInfo}; l_cuKernelError = cuLaunchKernel( p_pKernelOnGpu->cuFunction, p_pKernelOnGpu->BlockConfig.ui32GridSize, 1, 1, p_pKernelOnGpu->BlockConfig.ui32BlockSize, 1, 1, (unsigned int)p_pKernelOnGpu->stSharedMemorySize, p_pStreamInfo->cudaStream, l_arrayStreamArgs, nullptr );

  4. copy results back from device to host
    l_cuKernelError = cuMemcpyDtoHAsync( p_pStreamInfo->pHostResultInfo, p_pStreamInfo->devResultInfoPtr, (int32_t)sizeof( CUDA_RESULT_INFO ), p_pStreamInfo->cudaStream );

  5. launch host function to notify that the kernel finishes work
    l_cuKernelError = cuLaunchHostFunc( l_pStream->cudaStream, _Callback_EventKernelExecute, (void*)l_pStream );

I was hoping that step 1 and 2 for stream 1 is executed whale the kernel of stream 2 is executing - and vice versa. I verified my assumption with NVIDIA Nsight Syste, 2022.5.1 and got these results:

Here is the complete picture’ of the kernel executions. You can clearly see the alternating kernel executions of stream 1 and stream 2:

More detailed view to the host executions:

As you can see the calls from the host to the device for a stream is perfectly async and doesn’t wait for any execution to be finished. So that’s great. Looking now into the details of the copy and the memset on the device I cannot see any async behavior.

Showing more details:


Unfortunately is seems that all the memory functions do not execute while any other kernel from a different, non default stream is executed.

Is this expected or have I done some mistakes in my code?

Thanks a lot,

a cuda memset operation will typically launch a kernel under the hood. So it appears to me you are asking why that memset operation doesn’t overlap with other kernels (the way a copy operation would/could).

One possible reason is that those kernels (your “worker” kernels and the memset kernels) cannot run concurrently, due to resource issues.


I was not aware that cuMemsetD8Async will start a new kernel. If I remove the call to cuMemsetD8Async, will the two remaining memcpy ( cuMemcpyDtoHAsync and cuMemcpyHtoDAsync) execute during kernel execution?


It is certainly theoretically possible. I haven’t fully groked your posting, so there may be some lurking issues I haven’t spotted. I usually prefer to work on things where a complete test case has been provided.

I will point out that a WDDM GPU on windows may have some extra hurdles to witness the “usual” concurrency as indicated in the blog you linked. This has to do with command batching, and its difficult to sort out in source code.

You may also wish to experiment with the on and off settings for hardware GPU scheduling.


thanks for the information. I removed the call to cuMemsetD8Async, but without success. The calls to cuMemcpyDtoHAsync and cuMemcpyHtoDAsync are still executed only if no kernel is active, not async in regards to the GPU kernels - just to the host CPU.

Thanks for your replies anyway.

I forgot to reboot after switching off the MS GPU Scheduler. After reboot 2/3 works now! I got the cuMemcpyHtoDAsync and cuMemsetD8Async async to the kernel!!! That is already fantastic. Is there any chance to get the device to host memcpy async as well? If not - would it make sense to try Zero-Copy memory to come close to an async memcpy?