Memory copy/set async to kernel execution in different stream

TrailingStop · December 15, 2022, 6:04am

Hi,

I have a question regarding ‘async’ in cuMemcpyHtoDAsync and cuMemsetD8Async. From my understanding it means async with regards to the host CPU - but also include the kernel execution in a different stream. Here is an article of what I was hoping to get:

How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog

For my test I use a RTX 3060 on Windows 11 with VS2022 17.4.3 and cuda 12.0.

I use a dingle context for the application that I request with:
cuDevicePrimaryCtxRetain(&ctx, dev);

and release with:
cuDevicePrimaryCtxRelease(dev);

Whenever one of the threads in my application needs access to the GPU it first calls:
cuCtxPushCurrent(ctx);

then does the required cuda stuff and restore the old context with:
cuCtxPopCurrent(&ctx);

For each strem in my application I create a new stream handle with the StreamCreate function - I never use the default stream to avoid any issues with the async executions:
l_cuKernelError = cuStreamCreate( &l_iStreamInfo.cudaStream, CU_STREAM_NON_BLOCKING );

I need two memory blocks to execute the kernel. One is for readonly information, the other one is for the result of the kernel execution.

in the device code I define both memory blocks like this:

#define CUDA_STREAMS 2

device constant CUDA_KERNEL_INFO dev_CudaKernelInfo[CUDA_STREAMS];
device CUDA_RESULT_INFO dev_CudaResultInfo[CUDA_STREAMS];

Form what I understood from this ( CUDA Runtime API :: CUDA Toolkit Documentation (nvidia.com)) I need to allocated ‘pinned’ memory to enable async behavior.

In the host code I allocate the memory block for the const readonly memory like this:
l_cuKernelError = cuMemHostAlloc( (void**)&l_iStreamInfo.pHostKernelInfo, sizeof( CUDA_KERNEL_INFO ) * CUDA_STREAMS, CU_MEMHOSTALLOC_PORTABLE|CU_MEMHOSTALLOC_WRITECOMBINED );

The block for the results is allocated like this:
l_cuKernelError = cuMemHostAlloc( (void**)&l_iStreamInfo.pHostResultInfo, sizeof( CUDA_RESULT_INFO ) * CUDA_STREAMS, CU_MEMHOSTALLOC_PORTABLE );

The workflow of for each kernel execution is quite simple:

copy readonly memory to device
Inti result memory block
execute kernel
copy results back from device to host
launch host function to notify that the kernel finishes work

As soon as one stream finishes works the host start analyzing the results and restarts the kernel with new work.

This is how each step is implemented:

copy readonly memory to device
l_cuKernelError = cuMemcpyHtoDAsync( p_pStreamInfo->devKernelInfoPtr, (uint8_t*)p_pStreamInfo->pHostKernelInfo, sizeof( CUDA_KERNEL_INFO ), p_pStreamInfo->cudaStream )
rem: p_pStreamInfo->devKernelInfoPtr is a pointer to the correct array-item of the stream
Inti result memory block
l_cuKernelError = cuMemsetD8Async( p_pStreamInfo->devResultInfoPtr, 0, (int32_t)sizeof( CUDA_RESULT_INFO ), p_pStreamInfo->cudaStream );
rem: p_pStreamInfo->devResultInfoPtr is a pointer to the correct array-item of the stream
execute kernel
void* l_arrayStreamArgs[] = { &p_pStreamInfo->stStreamId, &p_pStreamInfo->ui64VarInfo}; l_cuKernelError = cuLaunchKernel( p_pKernelOnGpu->cuFunction, p_pKernelOnGpu->BlockConfig.ui32GridSize, 1, 1, p_pKernelOnGpu->BlockConfig.ui32BlockSize, 1, 1, (unsigned int)p_pKernelOnGpu->stSharedMemorySize, p_pStreamInfo->cudaStream, l_arrayStreamArgs, nullptr );
copy results back from device to host
l_cuKernelError = cuMemcpyDtoHAsync( p_pStreamInfo->pHostResultInfo, p_pStreamInfo->devResultInfoPtr, (int32_t)sizeof( CUDA_RESULT_INFO ), p_pStreamInfo->cudaStream );
launch host function to notify that the kernel finishes work
l_cuKernelError = cuLaunchHostFunc( l_pStream->cudaStream, _Callback_EventKernelExecute, (void*)l_pStream );

I was hoping that step 1 and 2 for stream 1 is executed whale the kernel of stream 2 is executing - and vice versa. I verified my assumption with NVIDIA Nsight Syste, 2022.5.1 and got these results:

Here is the complete picture’ of the kernel executions. You can clearly see the alternating kernel executions of stream 1 and stream 2:

More detailed view to the host executions:

As you can see the calls from the host to the device for a stream is perfectly async and doesn’t wait for any execution to be finished. So that’s great. Looking now into the details of the copy and the memset on the device I cannot see any async behavior.

Showing more details:

and:

Unfortunately is seems that all the memory functions do not execute while any other kernel from a different, non default stream is executed.

Is this expected or have I done some mistakes in my code?

Thanks a lot,
Daniel

Robert_Crovella · December 15, 2022, 3:32pm

a cuda memset operation will typically launch a kernel under the hood. So it appears to me you are asking why that memset operation doesn’t overlap with other kernels (the way a copy operation would/could).

One possible reason is that those kernels (your “worker” kernels and the memset kernels) cannot run concurrently, due to resource issues.

TrailingStop · December 15, 2022, 3:39pm

Hi,

I was not aware that cuMemsetD8Async will start a new kernel. If I remove the call to cuMemsetD8Async, will the two remaining memcpy ( cuMemcpyDtoHAsync and cuMemcpyHtoDAsync) execute during kernel execution?

Thanks.

Robert_Crovella · December 15, 2022, 3:58pm

It is certainly theoretically possible. I haven’t fully groked your posting, so there may be some lurking issues I haven’t spotted. I usually prefer to work on things where a complete test case has been provided.

I will point out that a WDDM GPU on windows may have some extra hurdles to witness the “usual” concurrency as indicated in the blog you linked. This has to do with command batching, and its difficult to sort out in source code.

You may also wish to experiment with the on and off settings for hardware GPU scheduling.

TrailingStop · December 15, 2022, 4:26pm

Hi,

thanks for the information. I removed the call to cuMemsetD8Async, but without success. The calls to cuMemcpyDtoHAsync and cuMemcpyHtoDAsync are still executed only if no kernel is active, not async in regards to the GPU kernels - just to the host CPU.

Thanks for your replies anyway.

TrailingStop · December 15, 2022, 6:14pm

I forgot to reboot after switching off the MS GPU Scheduler. After reboot 2/3 works now! I got the cuMemcpyHtoDAsync and cuMemsetD8Async async to the kernel!!! That is already fantastic. Is there any chance to get the device to host memcpy async as well? If not - would it make sense to try Zero-Copy memory to come close to an async memcpy?

Thanks.

Topic		Replies	Views
Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it? CUDA Programming and Performance	5	539	August 23, 2023
Is cudaMemset actually "asynchronous"? CUDA Programming and Performance	5	7788	January 5, 2016
Trying to run cudaMemsetAsync in a more timely manner CUDA Programming and Performance	8	1264	September 15, 2019
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16214	September 29, 2017
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	877	July 7, 2017
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1763	June 23, 2010
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	532	October 19, 2023
cudaMemSet with streams expected a version of cudaMemSet for steams CUDA Programming and Performance	8	6072	September 16, 2010
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1421	January 10, 2022
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1751	October 17, 2009

Memory copy/set async to kernel execution in different stream

Related topics