Hi,
I have a question regarding ‘async’ in cuMemcpyHtoDAsync and cuMemsetD8Async. From my understanding it means async with regards to the host CPU - but also include the kernel execution in a different stream. Here is an article of what I was hoping to get:
How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog
For my test I use a RTX 3060 on Windows 11 with VS2022 17.4.3 and cuda 12.0.
I use a dingle context for the application that I request with:
cuDevicePrimaryCtxRetain(&ctx, dev);
and release with:
cuDevicePrimaryCtxRelease(dev);
Whenever one of the threads in my application needs access to the GPU it first calls:
cuCtxPushCurrent(ctx);
then does the required cuda stuff and restore the old context with:
cuCtxPopCurrent(&ctx);
For each strem in my application I create a new stream handle with the StreamCreate function - I never use the default stream to avoid any issues with the async executions:
l_cuKernelError = cuStreamCreate( &l_iStreamInfo.cudaStream, CU_STREAM_NON_BLOCKING );
I need two memory blocks to execute the kernel. One is for readonly information, the other one is for the result of the kernel execution.
in the device code I define both memory blocks like this:
#define CUDA_STREAMS 2
device constant CUDA_KERNEL_INFO dev_CudaKernelInfo[CUDA_STREAMS];
device CUDA_RESULT_INFO dev_CudaResultInfo[CUDA_STREAMS];
Form what I understood from this ( CUDA Runtime API :: CUDA Toolkit Documentation (nvidia.com)) I need to allocated ‘pinned’ memory to enable async behavior.
In the host code I allocate the memory block for the const readonly memory like this:
l_cuKernelError = cuMemHostAlloc( (void**)&l_iStreamInfo.pHostKernelInfo, sizeof( CUDA_KERNEL_INFO ) * CUDA_STREAMS, CU_MEMHOSTALLOC_PORTABLE|CU_MEMHOSTALLOC_WRITECOMBINED );
The block for the results is allocated like this:
l_cuKernelError = cuMemHostAlloc( (void**)&l_iStreamInfo.pHostResultInfo, sizeof( CUDA_RESULT_INFO ) * CUDA_STREAMS, CU_MEMHOSTALLOC_PORTABLE );
The workflow of for each kernel execution is quite simple:
- copy readonly memory to device
- Inti result memory block
- execute kernel
- copy results back from device to host
- launch host function to notify that the kernel finishes work
As soon as one stream finishes works the host start analyzing the results and restarts the kernel with new work.
This is how each step is implemented:
-
copy readonly memory to device
l_cuKernelError = cuMemcpyHtoDAsync( p_pStreamInfo->devKernelInfoPtr, (uint8_t*)p_pStreamInfo->pHostKernelInfo, sizeof( CUDA_KERNEL_INFO ), p_pStreamInfo->cudaStream )
rem: p_pStreamInfo->devKernelInfoPtr is a pointer to the correct array-item of the stream -
Inti result memory block
l_cuKernelError = cuMemsetD8Async( p_pStreamInfo->devResultInfoPtr, 0, (int32_t)sizeof( CUDA_RESULT_INFO ), p_pStreamInfo->cudaStream );
rem: p_pStreamInfo->devResultInfoPtr is a pointer to the correct array-item of the stream -
execute kernel
void* l_arrayStreamArgs[] = { &p_pStreamInfo->stStreamId, &p_pStreamInfo->ui64VarInfo}; l_cuKernelError = cuLaunchKernel( p_pKernelOnGpu->cuFunction, p_pKernelOnGpu->BlockConfig.ui32GridSize, 1, 1, p_pKernelOnGpu->BlockConfig.ui32BlockSize, 1, 1, (unsigned int)p_pKernelOnGpu->stSharedMemorySize, p_pStreamInfo->cudaStream, l_arrayStreamArgs, nullptr );
-
copy results back from device to host
l_cuKernelError = cuMemcpyDtoHAsync( p_pStreamInfo->pHostResultInfo, p_pStreamInfo->devResultInfoPtr, (int32_t)sizeof( CUDA_RESULT_INFO ), p_pStreamInfo->cudaStream );
-
launch host function to notify that the kernel finishes work
l_cuKernelError = cuLaunchHostFunc( l_pStream->cudaStream, _Callback_EventKernelExecute, (void*)l_pStream );
I was hoping that step 1 and 2 for stream 1 is executed whale the kernel of stream 2 is executing - and vice versa. I verified my assumption with NVIDIA Nsight Syste, 2022.5.1 and got these results:
Here is the complete picture’ of the kernel executions. You can clearly see the alternating kernel executions of stream 1 and stream 2:
More detailed view to the host executions:
As you can see the calls from the host to the device for a stream is perfectly async and doesn’t wait for any execution to be finished. So that’s great. Looking now into the details of the copy and the memset on the device I cannot see any async behavior.
Showing more details:
and:
Unfortunately is seems that all the memory functions do not execute while any other kernel from a different, non default stream is executed.
Is this expected or have I done some mistakes in my code?
Thanks a lot,
Daniel